{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "1_5.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "TPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g1DmIL1fYN5k",
        "colab_type": "text"
      },
      "source": [
        "### 第五课 基于深度学习的chatbot"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pT4dsd-tYWbb",
        "colab_type": "text"
      },
      "source": [
        "### 主要内容\n",
        "* 更聪明的聊天机器人\n",
        "  * 1.生成式模型 VS 检索匹配模型\n",
        "  * 2.Chatterbot的进化：深度学习提高智能度\n",
        "* 模型构建\n",
        "  * 1.问题的分析与转化\n",
        "  * 2.数据集与样本构造方法\n",
        "  * 3.模型结构的构建\n",
        "  * 4.模型的评估\n",
        "  * 5.代码实现与解析"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CHYKx_m-Ffyf",
        "colab_type": "text"
      },
      "source": [
        "#### 基于检索的chatbot\n",
        "*从知识库中找到最相近的问题/答案，前提是知识库有这样的问题(固定数据集找到合适内容去回答)\n",
        "* 根据input和context，结合知识库的算法得到合适回复\n",
        "* 从一个固定的数据集中找到合适的内容作为回复\n",
        "* 检索和匹配的饿方式有很多种\n",
        "* **数据和匹配方法对质量有很大影响**\n",
        "* 优点：语法正确、速度快\n",
        "* 缺点：缺少会话的概念、没有上下文改变\n",
        "\n",
        "### 基于生成模型的chatbot\n",
        "* **不仅仅可以从数据库中找到，还可以自己生成**\n",
        "* 典型的是**seq2seq**的方法\n",
        "* 生成的结果需要考虑通畅度和准确度\n",
        "* 优点：结合上下文\n",
        "* 缺点：通畅度不一定流利\n",
        "\n",
        "以前者为主(可控度高)，后者为辅\n",
        "\n",
        "深度学习发挥什么作用？\n",
        "* 需要算法的地方就可以考虑深度学习的优势"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Njkoxparc0hL",
        "colab_type": "text"
      },
      "source": [
        "### 回顾Chatterbot\n",
        "\n",
        "机器人应答逻辑 --> Logic Adapters\n",
        "* Closest Match dapter\n",
        "  * 字符串模糊匹配(编辑距离)\n",
        "* Closest Meaning Adapter\n",
        "  * 借助nltk的WordNet，近义词评估\n",
        "* Time Logic Adapter"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-przSUH_d2_U",
        "colab_type": "text"
      },
      "source": [
        "但是应答模式的匹配方式太粗暴\n",
        "* 编辑距离无法捕获深层语义信息\n",
        "* 核心词+word2vec无法捕捉整句话语义\n",
        "* LSTM等RNN模型能补货序列信息\n",
        "* 用深度学习来提高匹配阶段准确率"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5Hin1kleefuB",
        "colab_type": "text"
      },
      "source": [
        "### 应该怎么做\n",
        "* 匹配本身是一个模糊的场景\n",
        "  * 转成排序问题\n",
        "* 排序问题怎么处理\n",
        "  * 转成能输出概率的0-1分类问题\n",
        "* 数据构建\n",
        "  * 需要正样本(正确的回答)和负样本(不对的回答)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "atFXBzp4e2-W",
        "colab_type": "text"
      },
      "source": [
        "### 关于数据\n",
        "\n",
        "* Ubuntu对话语料库--训练集\n",
        "\n",
        "  * Ubuntu对话数据集，来自Ubuntu的IRC网络上的对话日志\n",
        "  * 训练集1000000条实例，一半是正例（label为1），一半是负例（label为0，负例为随机生成）\n",
        "  * 样本包括上下文信息(context，即Query)和一段可能的回复内容，即Response；**Label为1表示Response和Query的匹配，Label为0则表示不匹配**。\n",
        "  * query的平均长度为86个word，而response的平均长度为17个word\n",
        "\n",
        "* 验证/测试集：\n",
        "  * 每个样本，有一个正例和九个负例数据(也称为干扰数据)\n",
        "  * 建模的目标在于给正例的得分尽可能的高，而给负例的得分\n",
        "尽可能的低。(有点类似分类任务)\n",
        "  * 语料做过分词、stemmed、lemmatized等文本预处理。\n",
        "  * 用NER(命名实体识别)将文本中的实体，如姓名、地点、组\n",
        "织、URL等替换成特殊字符 \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IckhX59eO72-",
        "colab_type": "text"
      },
      "source": [
        "### 数据集查看"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hroOUEbLnSwf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%matplotlib inline\n",
        "\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import matplotlib\n",
        "matplotlib.style.use('ggplot')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "E30Urjf0nS6t",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Load Data\n",
        "import csv\n",
        "train_df = pd.read_csv(\"/content/drive/My Drive/chatBot/train (1).csv\")\n",
        "train_df.Label = train_df.Label.astype('category')\n",
        "\n",
        "test_df = pd.read_csv(\"/content/drive/My Drive/chatBot/test.csv\")\n",
        "validation_df = pd.read_csv(\"/content/drive/My Drive/chatBot/valid.csv\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GoyjaMVHnS4H",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 173
        },
        "outputId": "a386ab43-c99b-4532-f34a-6bcbd8917d3f"
      },
      "source": [
        "train_df.describe()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Context</th>\n",
              "      <th>Utterance</th>\n",
              "      <th>Label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>1000000</td>\n",
              "      <td>1000000</td>\n",
              "      <td>1000000.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>unique</th>\n",
              "      <td>957130</td>\n",
              "      <td>744457</td>\n",
              "      <td>2.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>top</th>\n",
              "      <td>!ops __eou__ __eot__ ? __eou__ __eot__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>freq</th>\n",
              "      <td>13</td>\n",
              "      <td>11024</td>\n",
              "      <td>500127.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                        Context       Utterance      Label\n",
              "count                                   1000000         1000000  1000000.0\n",
              "unique                                   957130          744457        2.0\n",
              "top     !ops __eou__ __eot__ ? __eou__ __eot__   thanks __eou__        0.0\n",
              "freq                                         13           11024   500127.0"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M2dQyguVnS17",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 299
        },
        "outputId": "bafd64bb-16de-434d-a782-5d4450b4a1e2"
      },
      "source": [
        "train_df.Label.hist()\n",
        "plt.title(\"Training Label Distribution\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Text(0.5, 1.0, 'Training Label Distribution')"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        },
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEJCAYAAABhbdtlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAapUlEQVR4nO3df7RdZX3n8fdjrlI7FaHcmhKCxTXGTsFlVSjQ6rQUKgbHRahtv4qDRsuQcZBWB9spdlyDC/sDa0eGzlimASqhqxW/tWVMK5hSxHEsRfFnq9JqRFqSIDGA+AMrDd3zx36Ch8t97j333nPPSbjv11pn3bOf/eN5nnOS/Tn72fvsU7quQ5Kk2Txu0g2QJO2/DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEhqZUspJpZSulLJ2get1pZSzlqtdo1RKeXMpZfv+sp2B7X2wlHLFqLY3Y9uPaOuo2z5LfVeVUv5yubavhTEkVqC6U57rccciN30zcDiwa4HrHQ68Z5F1LsgBFkgfHHhPHiyl3F1KubGU8ppSyuNnLP4S4Pwht7u2bvOkIZvy28CJC2j6UEopZ5VSZvui1uuAnxt1fVqcqUk3QBNx+MDzHwP+BHgucFcte2hw4VLKE7que3C+jdZlvrzQxnRdt+B1VpA/At4ArAJWA6cAvwG8opTygq7rHgDouu7eUVdcSnkcULqu+wbwjVFvv6XruvvHVZfm55HECtR13Zf3PYB9O5evDJTtLqX8Yinlj0op9wN/AFBK+fVSym2llAdKKXeWUv53KeXJ+7Y7c7hpYPoFpZQP1fU+V0o5bbA9Mz/d1+lzSyl/UEr5eillRynljTPWOayU8sellG/WT9hvKaVsWcowReldXkr5YinlW6WU20spv1FKOWiWZV9e5/9TKeWGUspRM+a/oJTyV3U7O0sp7yylHLaIZn2rvi87u677RNd1bwNOAo4HfnmgvkcMN5VSnl/r/3p9fLqU8sI6+87696bBI8d9w0illJeWUv4OeBB4Rmt4aa7XYLZ1apu6UspR9Shm37+rfUdLV9XpRww31ffll2pdD9b35/Uztn1HKeWiUsqlpZR767+JS0opfhBeIkNCLRfSDx89F3hTLfsWsAk4GngV/c7qd4bY1m/Tf/r9YeAjwLtLKYcOUf+HgGcDvwn8RinllIH576zbezFwMrAWOGOItsylALuBlwM/BLweeDXwqzOWOxw4Fwjg3wIHA39aSikApZSTgfcC1wDPqu06anCZpei67m+A99MYkqk7xq30r/Vz6+PNwAN1kefWvz9T+/IjA6uvqX3bSP8+72g0Y87XYAg3A+cNbOtw+mGm2ZwLvAW4GDgGeBtwcSnl7BnL/QL90fAJ9fl5tR9aiq7rfKzgB/2OvgPWDpR1wJVDrPvTwLeBx822rYHplwyss7qWvXBGfWfNmP6dGXXdBvxmfb6uLnPKwPzH039C/st52vyIuobo438GvjAw/ea6jacPlD1jsD3AB4GLZ2znqXWZZw9sZ/s8dX8QuKIx72LggdmWBQ6tdZ3UWHftbPNrm/4FeOos5dtnTM/3Gjyqf8Dz6zJH1emz+l3Qo9p31eD7WN/X35qxzCXA7QPTdwBbZyxzPfCucf+feqw9PJJQy0dnFpRSXlKHjXaVUr4B/CHwBOD759nWp/Y96brubvpzHquHXafaNbDO0fXvLQPb/WfgY/Nsc16llHNKKR+pwxXfoD+K+YEZi32l67qHh1K6rvs8sIf+Uy70n8xfX0r5xr4H8Lk6b91S27ivqfQ73Efpuu4+4ApgWynl+lLKBaWUHxxyu3d3XfePQyw332swEqWUg+lD7UMzZv1f4KhSyncPlM31b0aLZEio5ZuDE6WUE4A/pv/P+tP0QxavqbOfMM+2ZjvpPd+/vZnrdLOsM9JbGJdSfg54B/Bu4EXAc4CL6I9SFuJxwFvph8oGH+voP92OwjHA7a2ZXdedAxwL3AD8BPCZUsp/HGK735x/kaH8C32QDVro67hQw/yb0QL5AmpYzwf2dF33pq7rPlI/OS7o+xAjtO9T+Y/uK6jj8Mcucbs/Dnyy67q3d1338a7rvkB/LmGm7yul/OuBup8BTA+062PAMV3XbZ/lseSrhEopzwJeSB/aTV3Xfab25TTgSvrzSfCdnemqJTRjvtdgN/CUUspgHc/lkR6s6zbb0XXd1+jPi/z4jFk/AXypq1d3afl45l/D+nv6HcPZwE30oXHuJBrSdd0XSil/Bryjfjr+Cv1logcz3NHFU0spz55Rtou+j2eXUjYAn6E/Kf6SWdZ/AHhnKWXf9xL+J/1Qx411+r8Bf1FKeTtwNfB1+qOInwPO67ruW8P1FIAnllK+n36H/hTgp4A30g8H/vZsK5RSng6cA/wZ/Xj+GvqTy5+oi+yhv6T11FLKZ4Fv1yGqhZjvNbgJ+G7golLK79MHxGtnbONL9e/ppZQP01/JNVuI/ibw30spX6A/93Iy8J9m2Z6WgUcSGkrXdX8O/Dr9VUp/C7yMgUswJ+DV9Dvy6+l3HDvph1b+aYh1fx345IzHzwO/R39Z5jtr2Qn0J2BnugvYTP8FwA/T7zBf0tWzpV3X3US/I3sW8P+Av6E/0fp14J8X2M+X1/ruALYB6+mvtjppjk/R36QPpWuAz9N/D+bhq4m6rvsX+h1s0H9K/+QC2wTzvwZ/Tx9UZ9K/Tz/PjKvEuq67FbiU/nXfDfyvRl2X0Qfvr9IfqfwKcEHXdVcuot1aoFLfU+mAVocs/o7+Cpc3TLo90mOFw006IJVSfpx++OWTwJPoL1U9iv7ySUkjYkjoQLWK/kt+T6cfwvkM8JNd1/3tRFslPcY43CRJavLEtSSp6bE43OShkSQtzqPuvfVYDAl27Vrozxn0pqen2bNnz4hbs3+zzyuDfX7sW2p/16xZM2u5w02SpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTUNdAhsRd9DfwfIhYG9mHhcR30v/4yxH0d+hMjLzvogo9Hd2fBH9nSFflZmfqNvZyHd+L/nXMnNLLT+W/p47TwSuA16XmV2rjiX1WJI0tIUcSfxkZj47M4+r0xcAN2bmOvp7yF9Qy0+jv03xOvofObkMoO7wL6S//fLxwIURcWhd5zL62wrvW2/9PHVIksZgKcNNG4At9fkW4IyB8qszs8vMW4BDIuJw+l/SuiEz761HAzcA6+u8gzPzlszs6H+k5Yx56pAkjcGw37jugL+IiA74vczcDKzOzLvq/C/znR8cP4L+17D22VHL5irfMUs5c9TxCBGxifrTjJnJ9PT0kN16pLt/+scWtd5Srb725onUCzA1NbXo1+tAZZ9Xhkn1eVL7kak/++iy9HfYkHh+Zu6MiKcAN0TE3w3OrOcPlvWeSXPVUUNrc53sDrSv4k+yvSvt1gVgn1eKldbnvXv3Tu62HJm5s/7dDVxLf07h7jpURP27uy6+EzhyYPW1tWyu8rWzlDNHHZKkMZg3JCLiX0XEk/Y9B06l/4GXrcDGuthG4L31+VbglRFRIuJE4P46ZLQNODUiDq0nrE8FttV5X4uIE+uVUa+csa3Z6pAkjcEwRxKrgQ9HxKeBjwLvy8z3AxcDL4iILwA/Vaehv4T1dmA7cDlwLkBm3gu8Bbi1Pi6qZdRlrqjrfJH+x+2Zow5J0hg8Fn+ZrlvsrcIfOuf0ETdlOKsu3zqRemHljduCfV4pJtXnSe1HVl978yjOSTzq9yT8xrUkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTVPDLhgRq4CPATsz88UR8TTgGuAw4OPAKzLzwYg4CLgaOBa4B3hpZt5Rt/FG4GzgIeAXM3NbLV8PXAqsAq7IzItr+ax1LLnXkqShLORI4nXAbQPTbwUuycynA/fR7/ypf++r5ZfU5YiIo4GXAccA64HfjYhVNXzeAZwGHA2cWZedqw5J0hgMFRIRsRb4d8AVdboAJwPvqYtsAc6ozzfUaer8U+ryG4BrMvPbmfklYDtwfH1sz8zb61HCNcCGeeqQJI3BsMNN/wP4L8CT6vRhwFczc2+d3gEcUZ8fAdwJkJl7I+L+uvwRwC0D2xxc584Z5SfMU8cjRMQmYFOtk+np6SG79Uh3L2qtpVtse0dhampqovVPgn1eGSbV50ntR5arv/OGRES8GNidmR+PiJNG3oIRyMzNwOY62e3Zs2eSzVmwSbZ3enp6ovVPgn1eGVZan/fu3buk/q5Zs2bW8mGGm54HnB4Rd9APBZ1Mf5L5kIjYFzJrgZ31+U7gSIA6/8n0J7AfLp+xTqv8njnqkCSNwbwhkZlvzMy1mXkU/YnnD2TmvwduAn62LrYReG99vrVOU+d/IDO7Wv6yiDioXrW0DvgocCuwLiKeFhFPqHVsreu06pAkjcFSvifxK8D5EbGd/vzBlbX8SuCwWn4+cAFAZn4WSOBzwPuB12bmQ/Wcw3nANvqrp7IuO1cdkqQxKF3XTboNo9bt2rVrUSs+dM7pI27KcFZdvnUi9cLKG7cF+7xSTKrPk9qPrL725lGckygzy/3GtSSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNU/MtEBHfBXwIOKgu/57MvDAingZcAxwGfBx4RWY+GBEHAVcDxwL3AC/NzDvqtt4InA08BPxiZm6r5euBS4FVwBWZeXEtn7WOEfVdkjSPYY4kvg2cnJk/DDwbWB8RJwJvBS7JzKcD99Hv/Kl/76vll9TliIijgZcBxwDrgd+NiFURsQp4B3AacDRwZl2WOeqQJI3BvCGRmV1mfqNOPr4+OuBk4D21fAtwRn2+oU5T558SEaWWX5OZ387MLwHbgePrY3tm3l6PEq4BNtR1WnVIksZgqHMS9RP/p4DdwA3AF4GvZubeusgO4Ij6/AjgToA6/3764aKHy2es0yo/bI46JEljMO85CYDMfAh4dkQcAlwL/JtlbdUCRcQmYBNAZjI9Pb2o7dw9ykYtwGLbOwpTU1MTrX8S7PPKMKk+T2o/slz9HSok9snMr0bETcCPAodExFT9pL8W2FkX2wkcCeyIiCngyfQnsPeV7zO4zmzl98xRx8x2bQY218luz549C+nWxE2yvdPT0xOtfxLs88qw0vq8d+/eJfV3zZo1s5bPO9wUEd9XjyCIiCcCLwBuA24CfrYuthF4b32+tU5T538gM7ta/rKIOKhetbQO+ChwK7AuIp4WEU+gP7m9ta7TqkOSNAbDnJM4HLgpIv6Gfod+Q2b+OfArwPkRsZ3+/MGVdfkrgcNq+fnABQCZ+Vkggc8B7wdem5kP1aOE84Bt9OGTdVnmqEOSNAal67pJt2HUul27di1qxYfOOX3ETRnOqsu3TqReWHmH5GCfV4pJ9XlS+5HV1948iuGmMrPcb1xLkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKapuZbICKOBK4GVgMdsDkzL42I7wXeDRwF3AFEZt4XEQW4FHgR8ADwqsz8RN3WRuBNddO/lplbavmxwFXAE4HrgNdlZteqY8m9liQNZZgjib3AGzLzaOBE4LURcTRwAXBjZq4DbqzTAKcB6+pjE3AZQN3hXwicABwPXBgRh9Z1LgPOGVhvfS1v1SFJGoN5QyIz79p3JJCZXwduA44ANgBb6mJbgDPq8w3A1ZnZZeYtwCERcTjwQuCGzLy3Hg3cAKyv8w7OzFsys6M/ahnc1mx1SJLGYN7hpkERcRTwHOAjwOrMvKvO+jL9cBT0AXLnwGo7atlc5TtmKWeOOma2axP9UQuZyfT09EK69bC7F7XW0i22vaMwNTU10fonwT6vDJPq86T2I8vV36FDIiK+B/gT4PWZ+bWIeHhePX/Qjbx1A+aqIzM3A5vrZLdnz57lbMrITbK909PTE61/EuzzyrDS+rx3794l9XfNmjWzlg91dVNEPJ4+IP4wM/+0Ft9dh4qof3fX8p3AkQOrr61lc5WvnaV8rjokSWMwb0jUq5WuBG7LzLcPzNoKbKzPNwLvHSh/ZUSUiDgRuL8OGW0DTo2IQ+sJ61OBbXXe1yLixFrXK2dsa7Y6JEljMMxw0/OAVwB/GxGfqmW/ClwMZEScDfwDsG/86Tr6y1+3018C+2qAzLw3It4C3FqXuygz763Pz+U7l8BeXx/MUYckaQxK1y3rqYRJ6Hbt2rWoFR865/QRN2U4qy7fOpF6YeWN24J9Xikm1edJ7UdWX3vzKM5JlJnlfuNaktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpKap+RaIiN8HXgzszsxn1rLvBd4NHAXcAURm3hcRBbgUeBHwAPCqzPxEXWcj8Ka62V/LzC21/FjgKuCJwHXA6zKza9Wx5B5LkoY2zJHEVcD6GWUXADdm5jrgxjoNcBqwrj42AZfBw6FyIXACcDxwYUQcWte5DDhnYL3189QhSRqTeUMiMz8E3DujeAOwpT7fApwxUH51ZnaZeQtwSEQcDrwQuCEz761HAzcA6+u8gzPzlszsgKtnbGu2OiRJYzLvcFPD6sy8qz7/MrC6Pj8CuHNguR21bK7yHbOUz1XHo0TEJvojFzKT6enphfYHgLsXtdbSLba9ozA1NTXR+ifBPq8Mk+rzpPYjy9XfxYbEw+r5g24UjVlsHZm5GdhcJ7s9e/YsZ3NGbpLtnZ6enmj9k2CfV4aV1ue9e/cuqb9r1qyZtXyxVzfdXYeKqH931/KdwJEDy62tZXOVr52lfK46JEljstiQ2ApsrM83Au8dKH9lRJSIOBG4vw4ZbQNOjYhD6wnrU4Ftdd7XIuLEemXUK2dsa7Y6JEljMswlsO8CTgKmI2IH/VVKFwMZEWcD/wBEXfw6+stft9NfAvtqgMy8NyLeAtxal7soM/edDD+X71wCe319MEcdkqQxKV23rKcTJqHbtWvXolZ86JzTR9yU4ay6fOtE6oWVN24L9nmlmFSfJ7UfWX3tzaM4J1FmlvuNa0lSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUtPUpBswn4hYD1wKrAKuyMyLJ9wkSVox9usjiYhYBbwDOA04GjgzIo6ebKskaeXYr0MCOB7Ynpm3Z+aDwDXAhgm3SZJWjP19uOkI4M6B6R3ACTMXiohNwCaAzGTNmjWLq+19H1vcege4Rb9eBzD7vDJMpM8T3I8sR3/39yOJoWTm5sw8LjOPA8piHxHx8aWsfyA+7PPKeNjnx/5jRP19lP09JHYCRw5Mr61lkqQx2N+Hm24F1kXE0+jD4WXAyyfbJElaOfbrI4nM3AucB2wDbuuL8rPLWOXmZdz2/so+rwz2+bFvWfpbuq5bju1Kkh4D9usjCUnSZBkSkqSm/f3E9bKY71YfEXEQcDVwLHAP8NLMvGPc7RylIfp8PvAfgL3AV4Cfz8x/GHtDR2jYW7pExM8A7wF+JDMP2C/LDNPfiAjgzUAHfDozD+gLQYb4d/1UYAtwSF3mgsy8buwNHaGI+H3gxcDuzHzmLPML/WvyIuAB4FWZ+YnF1rfijiSGvNXH2cB9mfl04BLgreNt5WgN2edPAsdl5rPod5i/Nd5Wjtawt3SJiCcBrwM+Mt4WjtYw/Y2IdcAbgedl5jHA68fe0BEa8j1+E/0FL8+hvzryd8fbymVxFbB+jvmnAevqYxNw2VIqW3EhwXC3+thA/+kD+h3mKTWdD1Tz9jkzb8rMB+rkLfTfSTmQDXtLl7fQfwj4p3E2bhkM099zgHdk5n0Ambl7zG0ctWH63AEH1+dPBnaNsX3LIjM/BNw7xyIbgKszs8vMW4BDIuLwxda3EkNitlt9HNFapl6Gez9w2FhatzyG6fOgs4Hrl7VFy2/ePkfEc4EjM/N942zYMhnmPX4G8IyI+KuIuKUO1RzIhunzm4GzImIHcB3wC+Np2kQt9P/7nFZiSGgOEXEWcBzwtkm3ZTlFxOOAtwNvmHRbxmiKfgjiJOBM4PKIOGSiLVp+ZwJXZeZa+jH6P6jvvYa0El+sYW718fAyETFFf5h6z1hatzyGur1JRPwU8F+B0zPz22Nq23KZr89PAp4JfDAi7gBOBLZGxHFja+FoDfMe7wC2ZuY/Z+aXgM/Th8aBapg+nw0kQGb+NfBdwPRYWjc5I72d0Uq8ummYW31sBTYCfw38LPCBzDyQv3U4b58j4jnA7wHrHwNj1TBPnzPzfgZ2FhHxQeCXDuCrm4b5d/1/6D9ZvzMipumHn24faytHa5g+/yNwCnBVRPwQfUh8ZaytHL+twHkRcQ39XbPvz8y7FruxFXck0brVR0RcFBGn18WuBA6LiO3A+cAFk2ntaAzZ57cB3wP8cUR8KiK2Tqi5IzFknx8zhuzvNuCeiPgccBPwy5l5wB4hD9nnNwDnRMSngXfRXw56IH/gIyLeRf8B9gcjYkdEnB0Rr4mI19RFrqMP/+3A5cC5S6nP23JIkppW3JGEJGl4hoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlS0/8HiYrJ81ThVHQAAAAASUVORK5CYII=\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QeEPZYI7nSzx",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 629
        },
        "outputId": "8b7371c7-0d06-4ad8-f48e-3595553e65e8"
      },
      "source": [
        "pd.options.display.max_colwidth = 500\n",
        "train_df.head()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Context</th>\n",
              "      <th>Utterance</th>\n",
              "      <th>Label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>i think we could import the old comments via rsync, but from there we need to go via email. I think it is easier than caching the status on each bug and than import bits here and there __eou__ __eot__ it would be very easy to keep a hash db of message-ids  __eou__ sounds good __eou__ __eot__ ok __eou__ perhaps we can ship an ad-hoc apt_prefereces __eou__ __eot__ version? __eou__ __eot__ thanks __eou__ __eot__ not yet __eou__ it is covered by your insurance? __eou__ __eot__ yes __eou__ but it...</td>\n",
              "      <td>basically each xfree86 upload will NOT force users to upgrade 100Mb of fonts for nothing __eou__ no something i did in my spare time. __eou__</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>I'm not suggesting all - only the ones you modify. __eou__ __eot__ ok, it sounds like you're agreeing with me, then __eou__ though rather than \"the ones we modify\", my idea is \"the ones we need to merge\" __eou__ __eot__</td>\n",
              "      <td>sorry __eou__ i thought it was ubuntu related. __eou__</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>afternoon all __eou__ not entirely related to warty, but if grub-install takes 5 minutes to install, is this a sign that i should just retry the install :) __eou__ __eot__ here  __eou__ __eot__ you might want to know that thinice in warty is buggy compared to that in sid __eou__ __eot__ and apparently GNOME is suddently almost perfect (out of the thinice problem), nobody report bugs :-P __eou__ I don't get your question, where do you want to paste ? __eou__ __eot__ can i file the panel not l...</td>\n",
              "      <td>Yep. __eou__ oh, okay. I wondered what happened to you __eou__ what distro do you need? __eou__ yes __eou__</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>interesting __eou__ grub-install worked with / being ext3, failed when it was xfs __eou__ i thought d-i installed the relevant kernel for your machine. i have a p4 and its installed the 386 kernel __eou__ holy crap a lot of stuff gets installed by default :) __eou__ YOU ARE INSTALLING VIM ON A BOX OF MINE __eou__ ;) __eou__ __eot__ more like osx than debian ;) __eou__ we have a selection of python modules available for great justice (and python development) __eou__ __eot__ 2.8 is fixing them...</td>\n",
              "      <td>thats the one __eou__</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>and because Python gives Mark a woody __eou__ __eot__ i'm not sure if we're meant to talk about that publically yet. __eou__ __eot__ and I thought we were a \"pants off\" kind of company ... :p __eou__ you need new glasses __eou__ __eot__ mono 1.0? dude, that's going to be a barrel of laughs for totally non-release related reasons during hoary __eou__ read bryan clark's entry about NetworkManager? __eou__ __eot__ there was an accompanying IRC conversation to that one &lt;g&gt; __eou__ explain ? __eo...</td>\n",
              "      <td>(i thought someone was going to make a joke about .au bandwidth...) __eou__ especially not if you're using screen ;) __eou__</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Context  ... Label\n",
              "0  i think we could import the old comments via rsync, but from there we need to go via email. I think it is easier than caching the status on each bug and than import bits here and there __eou__ __eot__ it would be very easy to keep a hash db of message-ids  __eou__ sounds good __eou__ __eot__ ok __eou__ perhaps we can ship an ad-hoc apt_prefereces __eou__ __eot__ version? __eou__ __eot__ thanks __eou__ __eot__ not yet __eou__ it is covered by your insurance? __eou__ __eot__ yes __eou__ but it...  ...   1.0\n",
              "1                                                                                                                                                                                                                                                                                         I'm not suggesting all - only the ones you modify. __eou__ __eot__ ok, it sounds like you're agreeing with me, then __eou__ though rather than \"the ones we modify\", my idea is \"the ones we need to merge\" __eou__ __eot__   ...   0.0\n",
              "2  afternoon all __eou__ not entirely related to warty, but if grub-install takes 5 minutes to install, is this a sign that i should just retry the install :) __eou__ __eot__ here  __eou__ __eot__ you might want to know that thinice in warty is buggy compared to that in sid __eou__ __eot__ and apparently GNOME is suddently almost perfect (out of the thinice problem), nobody report bugs :-P __eou__ I don't get your question, where do you want to paste ? __eou__ __eot__ can i file the panel not l...  ...   0.0\n",
              "3  interesting __eou__ grub-install worked with / being ext3, failed when it was xfs __eou__ i thought d-i installed the relevant kernel for your machine. i have a p4 and its installed the 386 kernel __eou__ holy crap a lot of stuff gets installed by default :) __eou__ YOU ARE INSTALLING VIM ON A BOX OF MINE __eou__ ;) __eou__ __eot__ more like osx than debian ;) __eou__ we have a selection of python modules available for great justice (and python development) __eou__ __eot__ 2.8 is fixing them...  ...   1.0\n",
              "4  and because Python gives Mark a woody __eou__ __eot__ i'm not sure if we're meant to talk about that publically yet. __eou__ __eot__ and I thought we were a \"pants off\" kind of company ... :p __eou__ you need new glasses __eou__ __eot__ mono 1.0? dude, that's going to be a barrel of laughs for totally non-release related reasons during hoary __eou__ read bryan clark's entry about NetworkManager? __eou__ __eot__ there was an accompanying IRC conversation to that one <g> __eou__ explain ? __eo...  ...   1.0\n",
              "\n",
              "[5 rows x 3 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-8iV55JmnSuW",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 435
        },
        "outputId": "df8b12ae-be76-4e5a-bec5-c68fa656afce"
      },
      "source": [
        "plt.figure(1)\n",
        "train_df_context_len = train_df.Context.str.split(\" \").apply(len)\n",
        "train_df_context_len.hist(bins=40)\n",
        "# 训练样本中问题长度的分布，主要接种在0-100之间\n",
        "plt.title(\"Training Context Length Statistics\")\n",
        "print(train_df_context_len.describe())"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "count    1000000.000000\n",
            "mean          77.877830\n",
            "std           66.837906\n",
            "min            7.000000\n",
            "25%           34.000000\n",
            "50%           57.000000\n",
            "75%           97.000000\n",
            "max         2034.000000\n",
            "Name: Context, dtype: float64\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEJCAYAAABhbdtlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAf1UlEQVR4nO3de9QcVZnv8e82LyBLxQSaieQyJygRDV64CeEiC1QgMEjAcR6DShKGQ2SEJR5YR0FzhAM4gzOOyKwDjOFySMYLPF4YooIxgzB41jGCIF4A0QDhkAQSQ0IAo8GEOn/s/UKl07vf7n7f7q6Q32etXm/Xrqq9n6p6u56u2ru7Q1EUiIiINPKqfgcgIiLVpSQhIiJZShIiIpKlJCEiIllKEiIikqUkISIiWUoSr3AhhCNDCEUIYUKb6xUhhI92Ky6pthDCshDC3H7H0YqRiDWEcFEIYelIxfRKoiRREemk3OyxrMOq/y+wB7CyzfX2AL7VYZttCyEcEkK4OYSwKoTwpxDCIyGEr4YQ9h/hdg5P+3PSSNab6p7bynEKIUxKMRw+0jG0q9WY26jv7SGE74QQngwhbAwhrAghfC+EsF9pmU0hhNkd1H1tCOHOBrPeBVzeYh254/9FYGq7MW0PlCSqY4/S469T2f6lsneVFw4h7NhKpUVRvFAUxVNFUbzYTjBpnT+1s06nQginAT8GXgA+ArwV+BCwDLiiFzHI8IUQdgd+BGwCTgTeDBhwL7Brt9otiuL3RVH8YZh1PF8UxZqRiukVpSgKPSr2AI4ECmBCqawAPgF8HVgP3JTKPw88BGwAngD+FXh9rq7S9NHAXWm9B4Hj6mIogI/WTX8c+DfgOWA5cEHdOrsB3wT+AKwCLgHmA//RZFvHAX8C/jUzf0zp+d7A94Hn0+O7wF6l+bOJJ6jDgPvStt0LvCvNn5S2o/y4s7T+DOD+FM8y4EvAa9K8Q4E/AyeXlj8qlR2b2q6v+6LMNg3GcXiT/ZKNJc2/E7gW+B/AU8BaYAHw2tIyrwL+Hvh92l83Ap8ENpX2V8OYU5sXE5P02nQ8LwcGmsR8UqrjtU2WWVbf5uBxBr4K/D/gj8DDwHlASPMvahDr7FKdc0ttTAd+no7/M8DdwH7Njn+qf2ldrO8jvnnZQHzN/SfwpjRvH2BRqv8PxNfgqf0+d3TlfNTvAPRocFDySeJp4GzgTcDkVD4XeHd6AbwX+A0wP1dXafoXwDRgMvC/gWfZ8oTcKEmsAs5I7Z+Vyt5bWmYh8FviyXOfVO96mieJT9Zva2a5nYHHgduBA9LjDmApsGNaZjbwIjH5vRt4C3Ab8BgwAIwivsMtiFdmbwB2La27DjgVeCNwBPBL4N9KMXyWeML8S2B3YAXwj6X4LiMm6jekR8OTJUMkiRZjuTOdoC5P23lMiu2S0jLnEpPDqek4n5uW2TRUzMQT7zrg/LSuERPi6U2O0cFpu/4r8KrMMrsTE/k5g22m8jektvYH9gQ+mmI/Lc1/LfA14u3TwVh3LsU6t1TPC8CnUj1vBT4MvH2I438RpSRBTBCbgS8D70z7+HTgLWn+L4lv2KakY3QccEK/zx1dOR/1OwA9GhyUfJK4roV1TwY2Dr5I6+sqTX+gtM7YVHZsXXv1SeJf6tp6CPiH9HwyWyeNHdIJqFmSuApY38J2nU58R1eri/uPwMw0PTvFsH9pmcET195p+vA0Pamu/mXAmXVlR6Rlx6TpVwH/QXx3eSvxHeoOpeXnAsta2JZJNE8SrcRyJ/CLumWuBn5Sml5BKWmkshtJSaJZzCmGhXVltwHfGGLbLiaepJ8lJvGLgLfWLbOJdBUwRF1XAItL09dSuvKri3UwSezX6PiWls0d/4vYMkn8GPhek9jWt7INr4SH+iS2LXfXF4QQPhBCuCuEsDKE8Dzx3daOxHdJzdw/+KQoilXEd01jW10nWVlaZ0r6u6RU75+Bnw1RZxhi/qB9gAeL0n3jFPfDad5LxcSrpHKM0GTb0r30/wJ8KYTw/OCDeFIE2Cu19yLxXfnbiSftGWkbR0yrsSS/qFv9peMRQng98VbekrplftJGOM2Od0NFUXwuLTM7tf3XwC9DCB9utl4I4VUhhPNDCPeHENakbT6TuC/a8UvibaBfp4EQ54QQJrZZB8Qr1R82mf9F4NoQwp1pZNSIDrCoEiWJbcsWnXMhhIOJfQB3Ea8g9ie+sCAmimZeaFA21P9D/TpFg3WKIeqo9zCwS7tDdJt4sSiKzQ3iabZtg/POAfYtPd5JvEL6VWnZfYHXAK8GOjn5DKWdWLpxPMpaqX8rRVGsK4riO0VRXAC8g3hF8fkhVjsPuAD4F2J/2b7EK4eWBmiU2t5MvPXzHuAeYpL6bQjhhHbqaaGdS4gd8w68DVgSQrh0JNuoCiWJbdvhwJqiKOYWRfHToih+C4zUybZdD6a/hwwWhBAGiO/Imvkm8fZYw3HuIYQx6ekDwJQQQq00byyxM/vXbcQ5eOIbNViQrkieIN6SWtrg8afU3huIHfGfB/4X8NUQwq51dY9iGFqNpYV61hPf+R9SN6t+mOewYx4ijoL4RuAvhmjzCOAHRVFcXxTFz4uiWEpMim3HWkR3F0Xx90VRHEHscD6tVAct1HMvsZ+nWTuPFkVxVVEUHwQ+B/zdULFtiwb6HYAMy8PA7iGE04nv1g4njkDquaIofhdC+C5wZQjhY8QRNecBu9Dk3WxRFCtCCGcDXwkhjAauAR4hDpmcTuwEP4LYSfg54KYQwn8n3qb6IvG++01thPo4sXP7+BDCTcDGdEL9LHBdCGEdcAuxk/atxFFfHwshBOLood8QR22NSnFdTxzVA7GD/A0hhEOA3wEbiqLY0CSWvdJtlbJlQ8XSxrb+M/A/Qwi/Id6q/Cviia98PNqNOSuE8H7gFGK/x8PE/Xwk8LfAzXVtHhVCuA14Id1CfBg4NYRwFPGYziT2J62rW+9vQgj7EAdRPFcUxca6GA4lDuD4IfAkMdG8A7guLZI7/vUuAW4LIXyZeIw3EhPuT1J8XwC+nWIaTRwE8mCDerZ9/e4U0WPrB/mO6482WPYS4gvmD8TO1FModczV19Wo7lS+RWdifXuN2id24t5Qmt6N+AG8DcBqYifmN4HvtrDNhwP/ntbbCDxKPCnvW1pm77SNg0Ngv0eDIbB19U5IsR9ZKvsU8YW+mS2HwJ5EPAlsIHa83g98Ls37NHFk0MTS8m8mDgc+K03vQExma2ltCGyjx4yhYknz7wSurat3i05o4p2CfwDW8PIQ2M8QT640i5m6YaWprGHHcWn+G4md5w+m9p4jXuV9ljQSKS03jTjo4QVeHgL7euKtm2eJo/iuJP5vl7dn13T815MZAkvsn7qVOCx4IzEp/BNpBFzu+NN4COyx6Rj8MbV5R9rGV6d99hhxiPJq4huVibl9sy0/Bscgi4y4EMIo4jvvhUVRnNfveARCCNcD7yyKYqjbgCKAbjfJCAohHEG89/xz4HXAfyO+a76hf1Ftv0II44gDGu4gvmt+P/E2ztn9jEu2LUoSMpJGEW957EW8j/5r4KiiKH7VdC3pls3A3xBv27ya+MHDvyuK4pq+RiXbFN1uEhGRLA2BFRGRrFfi7SZdGomIdGarb0B4JSYJVq5s96cToFarsWZN9b4puIpxVTEmqGZcVYwJqhlXFWOCasbVjZjGjRvXsFy3m0REJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREslr6MJ2ZLSN+N/xmYJO7H2hmuxK/Q30S8fvczd3XmVkg/oD58cTvwp/t7velembx8i+QXeru81P5AcRvCt2Z+F3w57h7kWtjWFssIiIta+cT10e5e/kjfucDt7v7ZWZ2fpr+NPH3ZSenx8HEHyE5OJ3wLwQOJH51xr1mtjCd9K8GzgB+SkwS04g//J5roys2n3Fi0/mjrlnYraZFRCppOLebphN/75f096RS+QJ3L9x9CTDazPYg/srTYndfmxLDYmBamreLuy9x94L4a2QnDdGGiIj0QKtXEgXwQzMrgK+4+zxgrLs/meY/BYxNz8cTf8h90PJU1qx8eYNymrSxBTObA8wBcHdqtVqLm/WygYGhd0Un9Q7XwMBAX9ptpooxQTXjqmJMUM24qhgTVDOuXsbUapI43N1XmNlfAIvN7Dflman/oKvfvtqsjZS05qXJopMvvmplh/fjS762ly8XGwlVjKuKMUE146piTFDNuCr3BX/uviL9XQ3cDBwErEq3ikh/V6fFVwATS6tPSGXNyic0KKdJGyIi0gNDJgkze42ZvW7wOXAM8WcpFwKz0mKzgFvS84XATDMLZjYVWJ9uGS0CjjGzMWY2JtWzKM171symppFRM+vqatSGiIj0QCtXEmOB/2NmvwDuBr7v7j8ALgOONrPfAe9L0xBHJz1K/D3da4CPA7j7WuJv7d6THhenMtIy16Z1HiGObKJJGyIi0gOvxN+4Ljr90aFVJx/adJl+DIHdXu6HjoQqxlXFmKCacVUxJqhmXF3sk9jql+n0iWsREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkayBVhc0s1HAz4AV7n6Cme0J3AjsBtwLnOruL5jZTsAC4ADgaeBD7r4s1XEBcDqwGfiEuy9K5dOAK4BRwLXuflkqb9jGsLdaRERa0s6VxDnAQ6XpLwCXu/tewDriyZ/0d10qvzwth5lNAWYA+wDTgKvMbFRKPlcCxwFTgFPSss3aEBGRHmgpSZjZBOCvgGvTdADeA3wrLTIfOCk9n56mSfPfm5afDtzo7hvd/TFgKXBQeix190fTVcKNwPQh2hARkR5o9XbTl4FPAa9L07sBz7j7pjS9HBifno8HngBw901mtj4tPx5YUqqzvM4TdeUHD9HGFsxsDjAntUmtVmtxs142MDD0ruik3uEaGBjoS7vNVDEmqGZcVYwJqhlXFWOCasbVy5iGPDOa2QnAane/18yO7H5I7XP3ecC8NFmsWbOm7Tpa2eGd1DtctVqtL+02U8WYoJpxVTEmqGZcVYwJqhlXN2IaN25cw/JWbjcdBpxoZsuIt4LeQ+xkHm1mg0lmArAiPV8BTARI819P7MB+qbxunVz5003aEBGRHhgySbj7Be4+wd0nETuef+TuHwHuAD6YFpsF3JKeL0zTpPk/cvcilc8ws53SqKXJwN3APcBkM9vTzHZMbSxM6+TaEBGRHhjO5yQ+DZxrZkuJ/QfXpfLrgN1S+bnA+QDu/gDgwIPAD4Cz3H1z6nM4G1hEHD3ladlmbYiISA+Eoij6HcNIK1auXNn2SrVajVUnH9p0mVHXLOw0po5tL/dDR0IV46piTFDNuKoYE1Qzri72SYT6cn3iWkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQka2CoBczs1cBdwE5p+W+5+4VmtidwI7AbcC9wqru/YGY7AQuAA4CngQ+5+7JU1wXA6cBm4BPuviiVTwOuAEYB17r7Zam8YRsjtO0iIjKEVq4kNgLvcfd3AvsC08xsKvAF4HJ33wtYRzz5k/6uS+WXp+UwsynADGAfYBpwlZmNMrNRwJXAccAU4JS0LE3aEBGRHhgySbh74e7Pp8kd0qMA3gN8K5XPB05Kz6enadL895pZSOU3uvtGd38MWAoclB5L3f3RdJVwIzA9rZNrQ0REemDI200A6d3+vcBexHf9jwDPuPumtMhyYHx6Ph54AsDdN5nZeuLtovHAklK15XWeqCs/OK2Ta6M+vjnAnNQmtVqtlc3awsDA0Luik3qHa2BgoC/tNlPFmKCacVUxJqhmXFWMCaoZVy9jailJuPtmYF8zGw3cDLylq1G1yd3nAfPSZLFmzZq262hlh3dS73DVarW+tNtMFWOCasZVxZigmnFVMSaoZlzdiGncuHENy9sa3eTuzwB3AIcAo81sMMlMAFak5yuAiQBp/uuJHdgvldetkyt/ukkbIiLSA0MmCTPbPV1BYGY7A0cDDxGTxQfTYrOAW9LzhWmaNP9H7l6k8hlmtlMatTQZuBu4B5hsZnua2Y7Ezu2FaZ1cGyIi0gOtXEnsAdxhZr8kntAXu/v3gE8D55rZUmL/wXVp+euA3VL5ucD5AO7+AODAg8APgLPcfXPqczgbWERMPp6WpUkbIiLSA6Eoin7HMNKKlStXtr1SrVZj1cmHNl1m1DULO42pY9vL/dCRUMW4qhgTVDOuKsYE1Yyri30Sob68pY5riTafcWJ2Xj8SiIhIt+lrOUREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJEtJQkREspQkREQkS0lCRESylCRERCRLSUJERLKUJEREJGtgqAXMbCKwABgLFMA8d7/CzHYFbgImAcsAc/d1ZhaAK4DjgQ3AbHe/L9U1C5ibqr7U3een8gOAG4CdgVuBc9y9yLUx7K0WEZGWtHIlsQk4z92nAFOBs8xsCnA+cLu7TwZuT9MAxwGT02MOcDVAOuFfCBwMHARcaGZj0jpXA2eU1puWynNtiIhIDwyZJNz9ycErAXd/DngIGA9MB+anxeYDJ6Xn04EF7l64+xJgtJntARwLLHb3telqYDEwLc3bxd2XuHtBvGop19WoDRER6YEhbzeVmdkkYD/gp8BYd38yzXqKeDsKYgJ5orTa8lTWrHx5g3KatFEf1xziVQvuTq1Wa2ezABgYaGtXbKWTNlsxMDDQtbo7VcWYoJpxVTEmqGZcVYwJqhlXL2Nq+cxoZq8Fvg180t2fNbOX5qX+g6IL8bXUhrvPA+alyWLNmjVt1z/cHd5Jm62o1Wpdq7tTVYwJqhlXFWOCasZVxZigmnF1I6Zx48Y1LG9pdJOZ7UBMEF9z9++k4lXpVhHp7+pUvgKYWFp9QiprVj6hQXmzNkREpAeGTBJptNJ1wEPu/qXSrIXArPR8FnBLqXymmQUzmwqsT7eMFgHHmNmY1GF9DLAozXvWzKamtmbW1dWoDRER6YFWbjcdBpwK/MrM7k9lnwEuA9zMTgceBwbvP91KHP66lDgE9jQAd19rZpcA96TlLnb3ten5x3l5COxt6UGTNkREpAdCUXS1K6EfipUrV7a9Uq1WY9XJh3bc6KhrFna8bjPby/3QkVDFuKoYE1QzrirGBNWMq4t9EqG+XJ+4FhGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJUpIQEZEsJQkREclSkhARkSwlCRERyVKSEBGRLCUJERHJGhhqATO7HjgBWO3ub0tluwI3AZOAZYC5+zozC8AVwPHABmC2u9+X1pkFzE3VXuru81P5AcANwM7ArcA57l7k2hj2FnfJ5jNObDp/1DULexSJiMjIaeVK4gZgWl3Z+cDt7j4ZuD1NAxwHTE6POcDV8FJSuRA4GDgIuNDMxqR1rgbOKK03bYg2RESkR4ZMEu5+F7C2rng6MD89nw+cVCpf4O6Fuy8BRpvZHsCxwGJ3X5uuBhYD09K8Xdx9ibsXwIK6uhq1ISIiPTLk7aaMse7+ZHr+FDA2PR8PPFFabnkqa1a+vEF5sza2YmZziFcuuDu1Wq3d7WFgoNNd0ZpOYoIYV6frdksVY4JqxlXFmKCacVUxJqhmXL2MadhnxtR/UIxEMJ224e7zgHlpslizZk3bbXR7h3cSE8S4Ol23W6oYE1QzrirGBNWMq4oxQTXj6kZM48aNa1je6eimVelWEenv6lS+AphYWm5CKmtWPqFBebM2RESkRzpNEguBWen5LOCWUvlMMwtmNhVYn24ZLQKOMbMxqcP6GGBRmvesmU1NI6Nm1tXVqA0REemRVobAfgM4EqiZ2XLiKKXLADez04HHAUuL30oc/rqUOAT2NAB3X2tmlwD3pOUudvfBzvCP8/IQ2NvSgyZtiIhIj4Si6Gp3Qj8UK1eubHulWq3GqpMP7UI4Uaefk9he7oeOhCrGVcWYoJpxVTEmqGZcXeyTCPXl+sS1iIhkKUmIiEiWkoSIiGQpSYiISJaShIiIZClJiIhIlpKEiIhkKUmIiEiWkoSIiGQpSYiISJaShIiIZClJiIhIlpKEiIhkKUmIiEiWkoSIiGQN+zeupTWbzzgxO6/T35oQEek2XUmIiEiWkoSIiGQpSYiISJaShIiIZClJiIhIlpKEiIhkKUmIiEiWkoSIiGTpw3QV0OyDdqvQh+1EpH90JSEiIllKEiIikqUkISIiWUoSIiKSpY7rbUCzjm1Qx7aIdI+uJEREJEtJQkREsip/u8nMpgFXAKOAa939sj6HVDn6QSMR6ZZKJwkzGwVcCRwNLAfuMbOF7v5gfyPbdqg/Q0SGo9JJAjgIWOrujwKY2Y3AdEBJYoQMlUQaWZX+KsGIvPJVPUmMB54oTS8HDq5fyMzmAHMA3J1x48Z11NjE7/+so/WkOjo99t1UxZigmnFVMSaoZly9iukV0XHt7vPc/UB3PxAInTzM7N5O1+3mo4pxVTGmqsZVxZiqGlcVY6pqXF2MaStVTxIrgIml6QmpTEREeqDqt5vuASab2Z7E5DAD+HB/QxIR2X5U+krC3TcBZwOLgIdikT/Qpebmdane4apiXFWMCaoZVxVjgmrGVcWYoJpx9SymUBRFr9oSEZFtTKWvJEREpL+UJEREJKvqHddd18+v/TCzicACYCxQAPPc/Qozuwg4A/h9WvQz7n5rWucC4HRgM/AJd1/UhbiWAc+lNja5+4FmtitwEzAJWAaYu68zs0Dcf8cDG4DZ7n5fF2LaO7U/6I3A54DR9Hhfmdn1wAnAand/Wypre/+Y2Sxgbqr2UnefP8Ix/RPwfuAF4BHgNHd/xswmEfv4Hk6rL3H3M9M6BwA3ADsDtwLnuHvH96QzcV1Em8dsJF+nmZhuAvZOi4wGnnH3fXu1r5qcC/r6fwXb+ZVE6Ws/jgOmAKeY2ZQehrAJOM/dpwBTgbNK7V/u7vumx+ALaApxhNc+wDTgqrQN3XBUavvANH0+cLu7TwZuT9MQ993k9JgDXN2NYNz94cH9ARxAfGHcnGb3el/dkOosa2v/pBf/hcQPhx4EXGhmY0Y4psXA29z9HcBvgQtK8x4p7bMzS+VXE0/ggzHX1zkScUEbx6wLr9OtYnL3D5X+v74NfKc0uxf7Kncu6Pf/1fadJCh97Ye7vwAMfu1HT7j7k4PZ392fI75jGd9klenAje6+0d0fA5YSt6EXpgOD70jmAyeVyhe4e+HuS4DRZrZHl2N5L/GF+3iTZbq2r9z9LmBtg/ba2T/HAovdfa27ryOe0Ds+yTSKyd1/mEYIAiwhfs4oK8W1i7svSe+IF5S2Y8TiaiJ3zEb0ddospvQO3YBvNKtjpPdVk3NBX/+vQLebWvraj15Il7X7AT8FDgPONrOZwM+I7zDWEeNdUlptOc2TSqcK4IdmVgBfcfd5wFh3fzLNf4p4WQyN9+F44Em6ZwZbvoj7ua8Gtbt/cuXd8rdsebtuTzP7OfAsMNfdf5zaX96jmNo9Zr16nb4bWOXuvyuV9XRf1Z0L+v5/tb1fSVSCmb2WeIn7SXd/lnjp+CZgX+LJ9p97HNLh7r4/8ZL2LDM7ojwzvXPqy9hpM9sROBH4Zirq977aSj/3TyNm9lni7YyvpaIngb909/2Ac4Gvm9kuPQypcses5BS2fAPS033V4Fzwkn79X23vSaLvX/thZjsQ/ym+5u7fAXD3Ve6+2d1fBK7h5dskPYnX3Vekv6uJ9/0PAlYN3kZKf1f3MqaS44D73H1VirGv+6qk3f3Tk/jMbDaxk/Yjg52q6XbO0+n5vcRO7Ten9su3pLr1/9XuMevVvhoAPkDpiquX+6rRuYAK/F9t70nipa/9SO9QZwA9+/7rdP/zOuAhd/9Sqbx8T/9k4Nfp+UJghpntlL6qZDJw9wjH9Boze93gc+CY1P5CYFZabBZwSymmmWYWzGwqsL50edwNW7zT6+e+qtPu/lkEHGNmY1LH4jGpbMSkEUGfAk509w2l8t0HO/HN7I3EffNoiutZM5ua/jdnlrZjJONq95j16nX6PuA37v7SbaRe7avcuYAK/F9t130S7r7JzAa/9mMUcL1372s/GjkMOBX4lZndn8o+Qxy9sS/x0nIZ8LEU7wNm5sTf09gEnOXum0c4prHAzWYG8f/j6+7+AzO7B3AzOx14nNi5B3Ho3/HETsYNwGkjHM9LUtI6mrQ/kn/s9b4ys28ARwI1M1tOHE1yGW3sH3dfa2aXEE+AABe7e6sdvK3GdAGwE7A4Hc/B4ZtHABeb2Z+BF4EzS21/nJeHdd6WHh3LxHVku8dsJF+njWJy9+vYuq8LerevcueCvv5fgb6WQ0REmtjebzeJiEgTShIiIpKlJCEiIllKEiIikqUkISIiWUoSIiKSpSQhIiJZ/x/vmQYAgTaZiwAAAABJRU5ErkJggg==\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nIKOwCbxnSsB",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 435
        },
        "outputId": "1ff56d30-6b37-4789-959c-b949185f9d95"
      },
      "source": [
        "plt.figure(2)\n",
        "train_df_utterance_len = train_df.Utterance.str.split(\" \").apply(len)\n",
        "train_df_utterance_len.hist(bins=40)\n",
        "#训练样本中回答的问题长度的分布，主要接种在0-20之间\n",
        "plt.title(\"Training Utterance Length Statistics\")\n",
        "print(train_df_utterance_len.describe())"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "count    1000000.000000\n",
            "mean          15.204147\n",
            "std           14.642695\n",
            "min            2.000000\n",
            "25%            6.000000\n",
            "50%           11.000000\n",
            "75%           20.000000\n",
            "max          584.000000\n",
            "Name: Utterance, dtype: float64\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEJCAYAAACHRBAhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dfbRcVZnn8e8mV14ahQQK09wkGmyidNDhtSEi7eKlCQnSBBx9wFYSaCROA4qDMwLddMcGpsXp1WJ6jWYMQZP00MJjkEVsgRheHFZPd5AXsREQDRgkuZAQEsKbgol7/tj7QlHWrdq37s29VZXfZ61at84+5+y9n6qTeursfeokxBgRERFpZqfR7oCIiHQGJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYHS6EcEwIIYYQJg5yvxhC+MT26pd0r046doajryGExSGE24erT51MCWOE5AO30WNNi1X/G7Av0DfI/fYFlrXYZrEQwuQc39F11n0hhLC6anlRCOEHdbbbGkI4a/v2tD20+gVgO/Wl7vsxhPqODiF8P4TwbAjh1yGEJ0MIy0II78zrJ+bYj2mh7ttDCIvrrCo+zkMInwgh1Pth2oXARwfbp27UM9od2IHsW/X8KOBG4FDg6Vy2rXrjEMLOMcbXmlWat3lmsJ2JMQ56n04XQtgJCDHGbU03lmEVQvhDYCXwDeC/Ay8Ak4EPAXtsr3aH4ziPMW4Zjr50hRijHiP8AI4BIjCxqiwCnwH+GdgC3JDL/wfwKPAK8BTwv4E9B6qravkE4O683yPAzJo+ROATNcvnAf8EvAisBS6t2Wdv4NvAy8B64ApgCXB7g1gn57qPrrPuC8Dqquex5nEWsKa2vGr/w4DvAy8BzwLfAd5ZWz9wOvBTYCvwh6REfSuwIe97LzCjpm9rgMuB+cCmHO/VQE/Ndufn1/fVXN+NVevekvvwC+DXwMPApwZ7bNTZ5tM5nl8DPwf+qrpfJX0HdgMWko61zcDXgC82ez9Kj5U6ff4s8GyTbWrbW5PL98vvbR/peH4IOLNqv8V19j1mgOP8k6R/T7/Or83dwMSq1736sbiq/ttr+no6cH+u57l8PI3L644G/l9+bV4EfgycONqfO8PxGPUO7IiPeh8Kefk54ALgD4Apufwy4I9JH7zH5w+KJQPVVbX8Y2AGMAX4Jukb3bia9moTxnrg3Nz++bns+KptlgM/A44FDsz1bqn9x1QT62TKEsZbgetIQ2y/nx+7AfuQPugv7C/P208lfdj/LXAA8D5SMvsZsGtV/a8A/xc4Eng38Lb8Gp2VY3g3cCXwGvDuqr6tIX2QXpJfQwN+A5xTtc3f5j5ckOs5FPirqvWLgf8AppM+9E4Hnq+uo+TYqPOaPQmclus8CfglcMUg+/6P+f0+BXgPKVlsafZ+lB4rdfp9en4fZzbY5pBcz4dze/vk8vfl1/ig3N6nc13H5vV7kj74b6jq6861xznpC8ZWYDbwzlzvJ0kJY+eqOPrr2LPqfby9qp9n59fzr0nH4X8iHZ8V0qjNJuDL+bWfkt+rPx7tz51h+ewa7Q7siI96Hwp5+dqCfU8jfZvdqV5dVcsfrtpnfC47saa92oTxjzVtPQp8MT+fUvuhQPoG/RTDkDDy8iLgB3W220r+dltVthi4vqZsF1KCOLWq/t8C7yh4XX/Mmz/s1wDLa7a5FfhWfr478Cvgvw1Q33657QNqyv8GeHAwx0bVut/L8dWeDc0Gnh9k31+lJnEBqwrfj4bHygBx7ZTr+y3pi9FtwMXApKptJlJ1dtDk/boZuKZq+XbyGUGdvvYnjNNISXGPAer8BFVnsDXHWnXC+CXwvwaoY1xpDJ340BxGe/lhbUEI4cOk0/n9SWO9O5G+Df0+jSe6H+x/EmNcH0LYRkocjTxYs9xXtc/U/HdVVb2/CSHcR/rWPtL+CNg/hPBSTfmupOTWb32M8ZfVG4QQ9iGdHRxHeh178n7vrKmr3uuxX35+YN7n+wP073AgAPeFEKrLe6iZrxqEA0lnXTfWTM6OAXYNIewTY3y2oO/7k46hVTXb/Dvwp4V9aXSs/I4Y42+BT4YQLiOdof4R8Cngr0MIJ8cYfzDQviGE3yMl2j8lzQXuTPpycFdhX/utBJ4AfhFCWAncCXwnxrixtIIQwtuBSQzwvscYN4cQFgErQgh3ks5ub4oxPjbIvrYlJYz28nL1QgjhSNIwyxdJE4WbgWmkeYOdm9RVb8K82VVxtfvEOvtEBqd/wnDPOuvGksaAW7ETaQz9qjrrnqt6/nKd9YuBdwCfJ80v/Aq4nt99TUtej0b9g3SBwyt16mlFf50fJQ291dpU9Xx7vJfVWnptYpqE/hbwrRDCJcCPgHnADxrs9vfALOAi4DHSe/oP1D+mGrX9UgjhcOADwJ8A/wX4nyGE42OM9w+mribtnBtCmE8aijwBuCKEcEGM8evD1cZo0WW17e1oYGOM8bIY4z0xxp+RTttHwyP57/v7C0IIPaRx4QHFGDeTJoOPqLP6CNKcTL/XSN+Wa9Urv480dvx4jHF1zWNzw0jgg8DXYozLY4wPka5Ue1eTfWo9Qkp20wdY3/8B9I46/Xt8kG31ezi3+a46da6O5Vd/rSa9pu+vKZ9WszzQ+zEsYrrC7wng7VXtUafNDwLXxRg9xvjjvM+7W+lrjHFbjPHuGOPfkI7dp4E/q24/hDBgPTHGDaRJ/oHe9/7tfhJj/HKMcSZwLTC3Wd86gc4w2ttjwD4hhHNIp99Hk65OGXExxp+HEL4LfDWE8CnSVUmfIw2TNfum+vfAvBDCOtI3yV2BOaSJ6GOqtvsF8NEQwoGkSdUXY4yv5vJjQwi3Aq/lIYS/Iw3h/Z/8be5Z0nzJqcD8GOMTDfrzGPDxEMK/kj5kLmeQH4z52+o/AF8IIfyKNNyxG3BSjPGLMcbVIYRvANeEED5PGu7ZnfQhtU+M8UtNmpgaQqjUlP2MFPff5SGp20n/ht8HHBJjvLiw7y+HEL4OXBlCWJ/rnUO6guzZqk0Hej8GLR8zh5KudnqcNP91CjCTN84SN5IuIpgeQngYeDUn/8eAWSGEG/P6i4De3Kfqvh4bQvgD0lntlhjjb2r6MIv0xeDuHOdhpOGlR6rqADglHxu/ijHWDnlCGs5ckF+7ZaQv3seSzlLHki4G+C5pfq+XdNHKA8UvVjsb7UmUHfHBwJPen6iz7RWkfxgvA7cAH8vbTq5XV726c/mbJo5r26vXPjUTiaTLapeRhlg2kD5ovw18t0m8O5H+ET1AulprQ6776Jrt9soxbuHNl3HOIE2qvsabL6t9H2nyczNpWGk16VLRvfL6L1A1iVuz37/lfdaQknBtrGuAy2r2e9MkMGmO4kLSB9pr+X36dtX6MaRhr5/m9RtJY9ofLTg26j2m5W0+SZpD+HWO/R7gLwbZ9/7Lal8gXbn1NeArwEMF70fTY6VOXIeQhgJX5+NnM+ks7ALyBRx5u9mkD+6tvHFZ7SRgBenfwNOkD+xra+LpTwQvMcBltaQzlTtJyaL/kuRLavr5FdLxGWl8We3HSRdKvEoaAv0eKVnsS0qKa/O6PuAaqi6F7+RHyMGLDFo+df8p6Yqcz412f2Ro8iTt5hjjfx7tvkh70pCUFAshfJA03vwj0pVR/5U0DLR49HolrQghvI80RPTvpMn+M0nDKjNHs1/S3pQwZDDGkH5IuD/ph0s/If146qFR7ZW0IgJ/QfoB306kM8XTYoy3jWqvpK1pSEpERIrosloRESnSdEjKzN5DukdLv3eRfnW5NJdPJl2VYe6+2cwC6aZnJ5GuhjjL3R/Idc0hDWkAXOnuS3L5YaRx8N1IV2Vc6O7RzPaq10aTLuuUSUSkNaHRyqYJw90fAw4GMLMxwDrgJtKNze5w96vM7JK8fDFp0qz/pltHAguAI/OH/zzSLRMicL+ZLc8JYAHpsst7SAljBuneNwO10VBf32D/awioVCps3Fh8h4C2123xQPfFpHjaW7fFA41j6u3tbbr/YIekjgced/cnST/VX5LLl5B+MEUuX+ru0d1XAWPNbF/gRGClu2/KSWIlMCOv28PdV7l7JJ25VNdVrw0RERlhg71K6gzSfWAAxrt7/3/+8wxv3HhsAukXjv3W5rJG5WvrlDdq403MbC75p/fuTqVS+wPZ5np6elrar111WzzQfTEpnvbWbfHA0GMqThhmtjPpp/yX1q7L8w3bde6gURvuvpD0q1WA2MppZLedfnZbPNB9MSme9tZt8cDIDknNBB5w9/77t6zPw0nkvxty+TrST/n7Tcxljcon1ilv1IaIiIywwSSMj/HGcBSk/31tTn4+h3RPn/7y2WYWzGwasCUPK60AppvZODMbR7rb44q87gUzm5avsJpdU1e9NkREZIQVJQwz2510X/fvVBVfBZxgZj8n3Vu+/46Tt5BuP7yadNOt8wDcfRPpRnr35sfluYy8zaK8z+OkK6QatSEiIiOsG3/pHXVZbffFA90Xk+Jpb90WDxTNYTT8HYZ+6S0iIkWUMEREpIjuVltl27mnNFw/5prlI9QTEZH2ozMMEREpooQhIiJFlDBERKSIEoaIiBRRwhARkSJKGCIiUkQJQ0REiihhiIhIESUMEREpooQhIiJFlDBERKSIEoaIiBRRwhARkSJKGCIiUkQJQ0REiihhiIhIESUMEREpooQhIiJFiv6LVjMbCywC3gtE4M+Bx4AbgMnAGsDcfbOZBWA+cBLwCnCWuz+Q65kDXJarvdLdl+Tyw4DFwG7ALcCF7h7NbK96bQwlYBERaU3pGcZ84DZ3PwA4CHgUuAS4w92nAHfkZYCZwJT8mAssAMgf/vOAI4EjgHlmNi7vswA4t2q/Gbl8oDZERGSENU0YZrYn8EHgWgB3f83dnwdmAUvyZkuAU/PzWcBSd4/uvgoYa2b7AicCK919Uz5LWAnMyOv2cPdV7h6BpTV11WtDRERGWMmQ1H7As8A3zewg4H7gQmC8uz+dt3kGGJ+fTwCeqtp/bS5rVL62TjkN2ngTM5tLOpvB3alUKgVhvVlPT/OXopV6R0tPT09H9bdEt8WkeNpbt8UDQ4+pJGH0AIcCn3b3e8xsPjVDQ3m+IbbciwKN2nD3hcDCvBg3btw46PpLXsRW6h0tlUqlo/pbottiUjztrdvigcYx9fb2Nt2/ZA5jLbDW3e/Jy8tICWR9Hk4i/92Q168DJlXtPzGXNSqfWKecBm2IiMgIa5ow3P0Z4Ckze08uOh54BFgOzMllc4Cb8/PlwGwzC2Y2DdiSh5VWANPNbFye7J4OrMjrXjCzafkKq9k1ddVrQ0RERljRZbXAp4HrzGxn4AngbFKycTM7B3gSsLztLaRLaleTLqs9G8DdN5nZFcC9ebvL3X1Tfn4eb1xWe2t+AFw1QBsiIjLCQozbdephNMS+vr5B71SpVFh/2lENtxlzzfJW+zTidrTx106keNpbt8UDRXMYodH++qW3iIgUUcIQEZEiShgiIlJECUNERIooYYiISBElDBERKaKEISIiRZQwRESkiBKGiIgUUcIQEZEiShgiIlJECUNERIooYYiISBElDBERKaKEISIiRZQwRESkiBKGiIgUUcIQEZEiShgiIlJECUNERIooYYiISBElDBERKdJTspGZrQFeBLYBW939cDPbC7gBmAysAczdN5tZAOYDJwGvAGe5+wO5njnAZbnaK919SS4/DFgM7AbcAlzo7nGgNoYUsYiItGQwZxjHuvvB7n54Xr4EuMPdpwB35GWAmcCU/JgLLADIH/7zgCOBI4B5ZjYu77MAOLdqvxlN2hARkRE2lCGpWcCS/HwJcGpV+VJ3j+6+ChhrZvsCJwIr3X1TPktYCczI6/Zw91XuHoGlNXXVa0NEREZY0ZAUEIHvm1kEvu7uC4Hx7v50Xv8MMD4/nwA8VbXv2lzWqHxtnXIatPEmZjaXdDaDu1OpVArDekNPT/OXopV6R0tPT09H9bdEt8WkeNpbt8UDQ4+pNGEc7e7rzOztwEoz+2n1yjzfEFvuRYFGbeQEtjAvxo0bNw66/pIXsZV6R0ulUumo/pbotpgUT3vrtnigcUy9vb1N9y8aknL3dfnvBuAm0hzE+jycRP67IW++DphUtfvEXNaofGKdchq0ISIiI6xpwjCz3c3sbf3PgenAT4DlwJy82Rzg5vx8OTDbzIKZTQO25GGlFcB0MxuXJ7unAyvyuhfMbFq+wmp2TV312hARkRFWcoYxHvhXM/sx8EPge+5+G3AVcIKZ/Rz4k7wM6bLYJ4DVwDXAeQDuvgm4Arg3Py7PZeRtFuV9HgduzeUDtSEiIiMsxLhdpx5GQ+zr6xv0TpVKhfWnHdVwmzHXLG+1TyNuRxt/7USKp711WzxQNIcRGu2vX3qLiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIj2lG5rZGOA+YJ27n2xm+wHXA3sD9wNnuvtrZrYLsBQ4DHgOON3d1+Q6LgXOAbYBn3H3Fbl8BjAfGAMscvercnndNoYctYiIDNpgzjAuBB6tWv4ScLW77w9sJiUC8t/NufzqvB1mNhU4AzgQmAF8zczG5ET0VWAmMBX4WN62URsiIjLCihKGmU0EPgQsyssBOA5YljdZApyan8/Ky+T1x+ftZwHXu/ur7v4LYDVwRH6sdvcn8tnD9cCsJm2IiMgIKx2S+grweeBteXlv4Hl335qX1wIT8vMJwFMA7r7VzLbk7ScAq6rqrN7nqZryI5u08SZmNheYm9ukUqkUhvWGnp7mL0Ur9Y6Wnp6ejupviW6LSfG0t26LB4YeU9NPSTM7Gdjg7veb2TEtt7QduftCYGFejBs3bhx0HSUvYiv1jpZKpdJR/S3RbTEpnvbWbfFA45h6e3ub7l8yJPUB4BQzW0MaLjqONEE91sz6E85EYF1+vg6YBJDX70ma/H69vGafgcqfa9CGiIiMsKYJw90vdfeJ7j6ZNGl9p7t/HLgL+EjebA5wc36+PC+T19/p7jGXn2Fmu+Srn6YAPwTuBaaY2X5mtnNuY3neZ6A2RERkhA3ldxgXAxeZ2WrSfMO1ufxaYO9cfhFwCYC7Pww48AhwG3C+u2/LcxQXACtIV2F53rZRGyIiMsJCjHG0+zDcYl9f36B3qlQqrD/tqIbbjLlmeat9GnE72vhrJ1I87a3b4oGiOYzQaH/90ltERIooYYiISBElDBERKaKEISIiRZQwRESkiBKGiIgUUcIQEZEiShgiIlJECUNERIooYYiISBElDBERKaKEISIiRZQwRESkiBKGiIgUUcIQEZEiShgiIlJECUNERIooYYiISBElDBERKaKEISIiRZQwRESkSE+zDcxsV+BuYJe8/TJ3n2dm+wHXA3sD9wNnuvtrZrYLsBQ4DHgOON3d1+S6LgXOAbYBn3H3Fbl8BjAfGAMscvercnndNoYpdhERGYSSM4xXgePc/SDgYGCGmU0DvgRc7e77A5tJiYD8d3Muvzpvh5lNBc4ADgRmAF8zszFmNgb4KjATmAp8LG9LgzZERGSENU0Y7h7d/aW8+Jb8iMBxwLJcvgQ4NT+flZfJ6483s5DLr3f3V939F8Bq4Ij8WO3uT+Szh+uBWXmfgdoQEZERVjSHkc8EHgQ2ACuBx4Hn3X1r3mQtMCE/nwA8BZDXbyENKb1eXrPPQOV7N2hDRERGWNM5DAB33wYcbGZjgZuAA7ZrrwbJzOYCcwHcnUqlMug6enqavxSt1Dtaenp6Oqq/JbotJsXT3rotHhh6TEUJo5+7P29mdwHvB8aaWU8+A5gIrMubrQMmAWvNrAfYkzT53V/er3qfeuXPNWijtl8LgYV5MW7cuHEwYQFlyaCVekdLpVLpqP6W6LaYFE9767Z4oHFMvb29TfdvOiRlZvvkMwvMbDfgBOBR4C7gI3mzOcDN+fnyvExef6e7x1x+hpntkq9+mgL8ELgXmGJm+5nZzqSJ8eV5n4HaEBGREVYyh7EvcJeZ/Qfpw32lu/8LcDFwkZmtJs03XJu3vxbYO5dfBFwC4O4PAw48AtwGnO/u2/LZwwXAClIi8rwtDdoQEZERFmKMo92H4Rb7+voGvVOlUmH9aUc13GbMNctb7dOI29FOpzuR4mlv3RYPFA1JhUb765feIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESK9Ix2BzrJtnNPGXBdJ/33rSIirdAZhoiIFFHCEBGRIkoYIiJSpOkchplNApYC44EILHT3+Wa2F3ADMBlYA5i7bzazAMwHTgJeAc5y9wdyXXOAy3LVV7r7klx+GLAY2A24BbjQ3eNAbQw5ahERGbSSM4ytwOfcfSowDTjfzKYClwB3uPsU4I68DDATmJIfc4EFAPnDfx5wJHAEMM/MxuV9FgDnVu03I5cP1IaIiIywpgnD3Z/uP0Nw9xeBR4EJwCxgSd5sCXBqfj4LWOru0d1XAWPNbF/gRGClu2/KZwkrgRl53R7uvsrdI+lsprquem2IiMgIG9RltWY2GTgEuAcY7+5P51XPkIasICWTp6p2W5vLGpWvrVNOgzZq+zWXdDaDu1OpVAYTFgA9PUO7wriVNrennp6etuvTUHVbTIqnvXVbPDD0mIo/Jc3srcCNwGfd/QUze31dnm+ILfeiQKM23H0hsDAvxo0bNw66/qEeGK20uT1VKpW269NQdVtMiqe9dVs80Dim3t7epvsXXSVlZm8hJYvr3P07uXh9Hk4i/92Qy9cBk6p2n5jLGpVPrFPeqA0RERlhTRNGvurpWuBRd/9y1arlwJz8fA5wc1X5bDMLZjYN2JKHlVYA081sXJ7sng6syOteMLNpua3ZNXXVa0NEREZYyZDUB4AzgYfM7MFc9pfAVYCb2TnAk0D/GNUtpEtqV5Muqz0bwN03mdkVwL15u8vdfVN+fh5vXFZ7a37QoA0RERlhIcbtOvUwGmJfX9+gd6pUKqw/7aiWG223e0ntaOOvnUjxtLduiweK5jBCo/31S28RESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkWUMEREpIgShoiIFFHCEBGRIkoYIiJSpKfZBmb2DeBkYIO7vzeX7QXcAEwG1gDm7pvNLADzgZOAV4Cz3P2BvM8c4LJc7ZXuviSXHwYsBnYDbgEudPc4UBtDjlhERFpScoaxGJhRU3YJcIe7TwHuyMsAM4Ep+TEXWACvJ5h5wJHAEcA8MxuX91kAnFu134wmbYiIyChomjDc/W5gU03xLGBJfr4EOLWqfKm7R3dfBYw1s32BE4GV7r4pnyWsBGbkdXu4+yp3j8DSmrrqtSEiIqOg6ZDUAMa7+9P5+TPA+Px8AvBU1XZrc1mj8rV1yhu18TvMbC7pjAZ3p1KpDDYeenpafSmSVtrcnnp6etquT0PVbTEpnvbWbfHA0GMa2qckkOcb4lDrGUob7r4QWJgX48aNGwfdxlAPjFba3J4qlUrb9Wmoui0mxdPeui0eaBxTb29v0/1bvUpqfR5OIv/dkMvXAZOqtpuYyxqVT6xT3qgNEREZBa0mjOXAnPx8DnBzVflsMwtmNg3YkoeVVgDTzWxcnuyeDqzI614ws2n5CqvZNXXVa0NEREZByWW13wKOASpmtpZ0tdNVgJvZOcCTgOXNbyFdUruadFnt2QDuvsnMrgDuzdtd7u79E+nn8cZltbfmBw3aEBGRURBi3K7TD6Mh9vX1DXqnSqXC+tOOarnRMdcsb3nf7WFHG3/tRIqnvXVbPFA0hxEa7a9feouISBElDBERKaKEISIiRZQwRESkiBKGiIgUUcIQEZEiQ741iCTbzj2l4fp2u+xWRGSwdIYhIiJFlDBERKSIEoaIiBRRwhARkSJKGCIiUkQJQ0REiihhiIhIESUMEREpooQhIiJFlDBERKSIEoaIiBTRvaRGSKN7Tek+UyLSCXSGISIiRZQwRESkiBKGiIgUafs5DDObAcwHxgCL3P2qUe7SsNP/pSEinaCtzzDMbAzwVWAmMBX4mJlNHd1eiYjsmNr9DOMIYLW7PwFgZtcDs4BHRrVXI6zZGUg96/NfnZ2IyHBp94QxAXiqanktcGTtRmY2F5gL4O709va21Nik793X0n4yclp9b9uV4mlv3RYPDC2mth6SKuXuC939cHc/HAitPMzs/lb3bcdHt8XTjTEpnvZ+dFs8hTE11O4JYx0wqWp5Yi4TEZER1u5DUvcCU8xsP1KiOAP4s9HtkojIjqmtzzDcfStwAbACeDQV+cPbqbmF26ne0dJt8UD3xaR42lu3xQNDjCnEGIerIyIi0sXa+gxDRETahxKGiIgUafdJ7+2uU289YmbfAE4GNrj7e3PZXsANwGRgDWDuvtnMAinGk4BXgLPc/YHR6PdAzGwSsBQYD0RgobvP79SYzGxX4G5gF9K/s2XuPi9fwHE9sDdwP3Cmu79mZruQ4j8MeA443d3XjErnG8h3X7gPWOfuJ3dBPGuAF4FtwFZ3P7xTjzkAMxsLLALeS/p39OfAYwxTPDv0GUaH33pkMTCjpuwS4A53nwLckZchxTclP+YCC0aoj4OxFficu08FpgHn5/eiU2N6FTjO3Q8CDgZmmNk04EvA1e6+P7AZOCdvfw6wOZdfnbdrRxeSLkDp1+nxABzr7gfn33FB5x5zkBLAbe5+AHAQ6b0atnh26IRB1a1H3P010jelWaPcpyLufjewqaZ4FrAkP18CnFpVvtTdo7uvAsaa2b4j09My7v50/7cbd3+RdKBPoENjyv16KS++JT8icBywLJfXxtMf5zLg+PwNsG2Y2UTgQ6RvsOT+ddDDG1YAAAKJSURBVGw8DXTkMWdmewIfBK4FcPfX3P15hjGeHT1h1Lv1yIRR6stwGO/uT+fnz5CGd6DD4jSzycAhwD10cExmNsbMHgQ2ACuBx4Hn8+Xi8OY+vx5PXr+FNMzTTr4CfB74bV7em86OB1IS/76Z3Z9vMQSde8ztBzwLfNPMfmRmi8xsd4Yxnh09YXQtd4+kfwwdxczeCtwIfNbdX6he12kxufs2dz+YdIeCI4ADRrlLLTOz/vmy+0e7L8PsaHc/lDQ8c76ZfbB6ZYcdcz3AocACdz8EeJk3hp+AocezoyeMbrv1yPr+U8r8d0Mu74g4zewtpGRxnbt/Jxd3dEwAeVjgLuD9pNP+/otNqvv8ejx5/Z6kyeJ28QHglDxJfD1pKGo+nRsPAO6+Lv/dANxESuydesytBda6+z15eRkpgQxbPDt6wnj91iNmtjPp1iOdfD/w5cCc/HwOcHNV+WwzC3nidUvVKWpbyOPb1wKPuvuXq1Z1ZExmtk++YgUz2w04gTQvcxfwkbxZbTz9cX4EuDN/G2wL7n6pu09098mkfyd3uvvH6dB4AMxsdzN7W/9zYDrwEzr0mHP3Z4CnzOw9ueh40n8FMWzx7NCX1br7VjPrv/XIGOAb2/HWI8PKzL4FHANUzGwtMA+4CnAzOwd4ErC8+S2kS+dWky6fO3vEO9zcB4AzgYfyuD/AX9K5Me0LLMlX4u1Euq3Nv5jZI8D1ZnYl8CPyBGX++09mtpp0McMZo9HpFlxM58YzHrjJzCB9Fv6zu99mZvfSmcccwKeB6/IX4CdIfdyJYYpHtwYREZEiO/qQlIiIFFLCEBGRIkoYIiJSRAlDRESKKGGIiEgRJQwRESmihCEiIkX+P0I4LMWFMdoKAAAAAElFTkSuQmCC\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": [],
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iVbCFLZJnSqH",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "1a8f5dcf-eac8-4236-f391-f5a7ac6c05d4"
      },
      "source": [
        "pd.options.display.max_colwidth = 500\n",
        "# 测试集的分布，context表示提问的问题，Ground Truth utterance表示问题的正确答案，Distractor表示9个负倒样本\n",
        "test_df.head()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Context</th>\n",
              "      <th>Ground Truth Utterance</th>\n",
              "      <th>Distractor_0</th>\n",
              "      <th>Distractor_1</th>\n",
              "      <th>Distractor_2</th>\n",
              "      <th>Distractor_3</th>\n",
              "      <th>Distractor_4</th>\n",
              "      <th>Distractor_5</th>\n",
              "      <th>Distractor_6</th>\n",
              "      <th>Distractor_7</th>\n",
              "      <th>Distractor_8</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>anyone knows why my stock oneiric exports env var 'USERNAME'?  I mean what is that used for?  I know of $USER but not $USERNAME .  My precise install doesn't export USERNAME __eou__ __eot__ looks like it used to be exported by lightdm, but the line had the comment \"// FIXME: Is this required?\" so I guess it isn't surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__</td>\n",
              "      <td>nice thanks! __eou__</td>\n",
              "      <td>wrong channel for it, but check efnet.org, unofficial page. __eou__</td>\n",
              "      <td>every time the kernel changes, you will lose video __eou__ yep __eou__</td>\n",
              "      <td>ok __eou__</td>\n",
              "      <td>!nomodeset &gt; acer __eou__ I'm assuming it is a driver issue. __eou__ !pm &gt; acer __eou__ i DON'T pm. ;) __eou__ OOPS SORRY FOR THE CAPS __eou__</td>\n",
              "      <td>http://www.ubuntu.com/project/about-ubuntu/derivatives  (some call them derivatives, others call them flavors, same difference) __eou__</td>\n",
              "      <td>thx __eou__ unfortunately the program isn't installed from the repositories __eou__</td>\n",
              "      <td>how can I check? By doing a recovery for testing? __eou__</td>\n",
              "      <td>my humble apologies __eou__</td>\n",
              "      <td>#ubuntu-offtopic __eou__</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>i set up my hd such that i have to type a passphrase to access it at boot. how can i remove that passwrd, and just boot up normal. i did this at install, it works fine, just tired of having reboots where i need to be at terminal to type passwd in. help? __eou__ __eot__ backup your data, and re-install without encryption \"might\" be the easiest method __eou__ __eot__</td>\n",
              "      <td>so you dont know, ok, anyone else? __eou__ you are like, yah my mouse doesnt work, reinstall your os lolol what a joke __eou__</td>\n",
              "      <td>nmap is nice, but it wasn't what I was looking for.  I finally found it again: mtr (my traceroute) is what I was looking for.  I'll be keeping nmap handy though. __eou__</td>\n",
              "      <td>ok __eou__</td>\n",
              "      <td>cdrom worked fine on windows. __eou__ i dont think it has anything to do with the buring process, cds work fine on my desktop and my other ubuntu lap __eou__</td>\n",
              "      <td>ah yes, i have read return as rerun __eou__</td>\n",
              "      <td>hm? __eou__</td>\n",
              "      <td>not the case, LTS is every other .04 release. The .04 isn't always more stable __eou__ I would reinstall with Precise __eou__ you can restore user data and such from backup __eou__</td>\n",
              "      <td>Pretty much __eou__</td>\n",
              "      <td>I used the one I downloaded from AMD __eou__</td>\n",
              "      <td>ffmpeg is part of the package , quixotedon , at least I'm quite sure it still is __eou__ if not just install ffmpeg __eou__</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>im trying to use ubuntu on my macbook pro retina __eou__ i read in the forums that ubuntu has a apple version now? __eou__ __eot__  not that ive ever heard of..  normal ubutnu should work on an intel based mac. there is the PPC version also. __eou__  you want total control? or what are you wanting exactly? __eou__ __eot__</td>\n",
              "      <td>just wondering how it runs __eou__</td>\n",
              "      <td>yes, that's what I did, exported it to a \"id_dsa\" file, then back to Ubuntu copied it into ~/.ssh/ __eou__</td>\n",
              "      <td>nothing - i am talking about the question of myhero __eou__</td>\n",
              "      <td>that should fix the fonts being too large __eou__</td>\n",
              "      <td>okay, so hcitool echos back hci0 &lt;mac address of controller&gt; but the bluetooth devices panel keeps disconnecting and reconnecting the device (or so it seems) any idea why that would be? __eou__</td>\n",
              "      <td>I get to the menu with options such as 'try ubuntu', 'install ubuntu', 'check disc' __eou__</td>\n",
              "      <td>why do u need analyzer __eou__ it is a toy __eou__ ok msp301 __eou__ but y, i mean it is the same ubunut, only with older programs __eou__ ubuntu 804 or 1204 __eou__ no i dont use 804 __eou__ i am asking hypo qs __eou__</td>\n",
              "      <td>Cntrl-C may stop the command but it doesn't fix my HDD problem. __eou__</td>\n",
              "      <td>if you're only going to run Ubuntu, just get a normal PC rather than a mac __eou__ that said, I'm running it on a macbook, because I got one relatively cheaply __eou__</td>\n",
              "      <td>the ones which are not picked up at the moment are on STDERR and not STDOUT and &gt; is only covering STDOUT __eou__</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>no suggestions? __eou__ links? __eou__ how can i remove luks passphrase at boot. i dont want to use feature anymore... __eou__ __eot__ you may need to create a new volume __eou__ __eot__ that leads me to the next question lol... i dont know how to create new volumes exactly in cmdline, usually i use a gui. im just trying to access this server via usb loaded with next os im going to load, the luks pw is stopping me __eou__ __eot__ for something like that I would likely use something like a li...</td>\n",
              "      <td>you cant load anything via usb or cd when luks is running __eou__ it wont allow usb boot, i tried with 2 diff usb drives __eou__</td>\n",
              "      <td>-p  sorry... __eou__  nmap -p22 __eou__ It doesn't say:  22/tcp open  ssh  ? __eou__</td>\n",
              "      <td>i guess so i can't even launch it. __eou__</td>\n",
              "      <td>noted __eou__</td>\n",
              "      <td>rxvt-unicode is one __eou__</td>\n",
              "      <td>I tarred all of ~ __eou__</td>\n",
              "      <td>I tarred all of ~ __eou__</td>\n",
              "      <td>I don't really know if I can help, but I was curious. lol __eou__ That's cool. I'll look into it. Now, we better stop talking about this since it's offtopic. :P __eou__</td>\n",
              "      <td>that works just fine, thanks! __eou__</td>\n",
              "      <td>thank you __eou__</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>I just added a second usb printer but not sure what the uri should read - can anyone help with usb printers? __eou__ __eot__ firefox localhost:631 __eou__ __eot__ firefox? __eou__ __eot__ yes __eou__ firefox localhost:631 __eou__ firefox http://localhost:631 __eou__ cups has a web based interface __eou__ __eot__</td>\n",
              "      <td>i was setting it up under the printer configuration __eou__ thanks! __eou__</td>\n",
              "      <td>i'd say the most commonly venue would be via Launchpad. check out the factoid !bug as well __eou__</td>\n",
              "      <td>the old hardy man page, http://manpages.ubuntu.com/manpages/hardy/man1/gcalctool.1.html says \"delete\" clears the screen, but it doesn't __eou__ because LTS are good __eou__</td>\n",
              "      <td>i'll give a try __eou__</td>\n",
              "      <td>by the way, the url you posted for davfs is from dapper... that's 5.xx iirc __eou__</td>\n",
              "      <td>http://ubuntuforums.org/showthread.php?t=1549847 __eou__</td>\n",
              "      <td>So I load up putty gui, then what do I do? __eou__</td>\n",
              "      <td>you should read error messages, it says 'are you root?' __eou__</td>\n",
              "      <td>waiting the college semester to close just to make sure I will not need to reconfigure my environment again __eou__</td>\n",
              "      <td>I was calling myself a jerk. All I know is that you downloaded a game successfully. __eou__</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Context  ...                                                                                                                 Distractor_8\n",
              "0           anyone knows why my stock oneiric exports env var 'USERNAME'?  I mean what is that used for?  I know of $USER but not $USERNAME .  My precise install doesn't export USERNAME __eou__ __eot__ looks like it used to be exported by lightdm, but the line had the comment \"// FIXME: Is this required?\" so I guess it isn't surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__   ...                                                                                                     #ubuntu-offtopic __eou__\n",
              "1                                                                                                                                     i set up my hd such that i have to type a passphrase to access it at boot. how can i remove that passwrd, and just boot up normal. i did this at install, it works fine, just tired of having reboots where i need to be at terminal to type passwd in. help? __eou__ __eot__ backup your data, and re-install without encryption \"might\" be the easiest method __eou__ __eot__   ...  ffmpeg is part of the package , quixotedon , at least I'm quite sure it still is __eou__ if not just install ffmpeg __eou__\n",
              "2                                                                                                                                                                                 im trying to use ubuntu on my macbook pro retina __eou__ i read in the forums that ubuntu has a apple version now? __eou__ __eot__  not that ive ever heard of..  normal ubutnu should work on an intel based mac. there is the PPC version also. __eou__  you want total control? or what are you wanting exactly? __eou__ __eot__   ...            the ones which are not picked up at the moment are on STDERR and not STDOUT and > is only covering STDOUT __eou__\n",
              "3  no suggestions? __eou__ links? __eou__ how can i remove luks passphrase at boot. i dont want to use feature anymore... __eou__ __eot__ you may need to create a new volume __eou__ __eot__ that leads me to the next question lol... i dont know how to create new volumes exactly in cmdline, usually i use a gui. im just trying to access this server via usb loaded with next os im going to load, the luks pw is stopping me __eou__ __eot__ for something like that I would likely use something like a li...  ...                                                                                                            thank you __eou__\n",
              "4                                                                                                                                                                                           I just added a second usb printer but not sure what the uri should read - can anyone help with usb printers? __eou__ __eot__ firefox localhost:631 __eou__ __eot__ firefox? __eou__ __eot__ yes __eou__ firefox localhost:631 __eou__ firefox http://localhost:631 __eou__ cups has a web based interface __eou__ __eot__   ...                                  I was calling myself a jerk. All I know is that you downloaded a game successfully. __eou__\n",
              "\n",
              "[5 rows x 11 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 9
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rHJLZSRrnSnz",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "540479a9-361a-457e-de0c-a3be0e96ddb3"
      },
      "source": [
        "test_df.describe()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Context</th>\n",
              "      <th>Ground Truth Utterance</th>\n",
              "      <th>Distractor_0</th>\n",
              "      <th>Distractor_1</th>\n",
              "      <th>Distractor_2</th>\n",
              "      <th>Distractor_3</th>\n",
              "      <th>Distractor_4</th>\n",
              "      <th>Distractor_5</th>\n",
              "      <th>Distractor_6</th>\n",
              "      <th>Distractor_7</th>\n",
              "      <th>Distractor_8</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "      <td>18920</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>unique</th>\n",
              "      <td>18920</td>\n",
              "      <td>18026</td>\n",
              "      <td>14066</td>\n",
              "      <td>13998</td>\n",
              "      <td>14162</td>\n",
              "      <td>14116</td>\n",
              "      <td>14200</td>\n",
              "      <td>14149</td>\n",
              "      <td>14064</td>\n",
              "      <td>14068</td>\n",
              "      <td>14202</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>top</th>\n",
              "      <td>howto fix a corrupt square mouse pointer every cold boot in precise?(ati grafix card) __eou__ __eot__ were you here yesterday or so with same problem ? __eou__ did you try restarting x ? __eou__ sudo service login-manager restart __eou__ probably not __eou__ however if restrting x makes it go away then there may be a setting in your x config to fix it __eou__ __eot__ whats up mate __eou__ i got the square mouse away, but compiz doesnt run anymore __eou__ __eot__ nada raping a ftp server with...</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "      <td>thanks __eou__</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>freq</th>\n",
              "      <td>1</td>\n",
              "      <td>152</td>\n",
              "      <td>133</td>\n",
              "      <td>142</td>\n",
              "      <td>163</td>\n",
              "      <td>148</td>\n",
              "      <td>130</td>\n",
              "      <td>164</td>\n",
              "      <td>155</td>\n",
              "      <td>151</td>\n",
              "      <td>156</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Context  ...    Distractor_8\n",
              "count                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 18920  ...           18920\n",
              "unique                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                18920  ...           14202\n",
              "top     howto fix a corrupt square mouse pointer every cold boot in precise?(ati grafix card) __eou__ __eot__ were you here yesterday or so with same problem ? __eou__ did you try restarting x ? __eou__ sudo service login-manager restart __eou__ probably not __eou__ however if restrting x makes it go away then there may be a setting in your x config to fix it __eou__ __eot__ whats up mate __eou__ i got the square mouse away, but compiz doesnt run anymore __eou__ __eot__ nada raping a ftp server with...  ...  thanks __eou__\n",
              "freq                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1  ...             156\n",
              "\n",
              "[4 rows x 11 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9oYfdJtjPKvF",
        "colab_type": "text"
      },
      "source": [
        "### 基线模型：random guess \n",
        "* 评估y中前K个后选中是否存在正确的标签，存在则count++，统计"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jfUH0F13nSlR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.feature_extraction.text import TfidfTransformer"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Tlh56AEinSi8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Load Data\n",
        "train_df = pd.read_csv(\"/content/drive/My Drive/chatBot/train (1).csv\")\n",
        "test_df = pd.read_csv(\"/content/drive/My Drive/chatBot/test.csv\")\n",
        "validation_df = pd.read_csv(\"/content/drive/My Drive/chatBot/valid.csv\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "D65i_UOGnSg1",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "d77afbdf-9d52-449d-ec36-df169b755ebd"
      },
      "source": [
        "y_test = np.zeros(len(test_df))\n",
        "y_test"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([0., 0., 0., ..., 0., 0., 0.])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XCST6E8enSWI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def evaluate_recall(y, y_test, k=1):\n",
        "    num_examples = float(len(y))\n",
        "    num_correct = 0\n",
        "    for predictions, label in zip(y, y_test):\n",
        "        if label in predictions[:k]:#评估y中前K个后选中是否存在正确的标签，存在则count++，统计\n",
        "            num_correct += 1\n",
        "    return num_correct/num_examples"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "etl9MeqGnSTt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def predict_random(context, utterances):\n",
        "    return np.random.choice(len(utterances), 10, replace=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dVGTSfTHnSRr",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 85
        },
        "outputId": "b992000c-d82c-4835-f81b-cb3d9dad9321"
      },
      "source": [
        "# Evaluate Random predictor\n",
        "y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]\n",
        "for n in [1, 2, 5, 10]:\n",
        "    print(\"Recall @ ({}, 10): {:g}\".format(n, evaluate_recall(y_random, y_test, n)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Recall @ (1, 10): 0.0986786\n",
            "Recall @ (2, 10): 0.199313\n",
            "Recall @ (5, 10): 0.497093\n",
            "Recall @ (10, 10): 1\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J9vr8l98AKwl",
        "colab_type": "text"
      },
      "source": [
        "### 基线模型：TF-IDF检索"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eGChjkbzQAfL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.feature_extraction.text import TfidfTransformer\n",
        "\n",
        "class TFIDFPredictor:\n",
        "    def __init__(self):\n",
        "        self.vectorizer = TfidfVectorizer()\n",
        "\n",
        "    def train(self, data):\n",
        "        self.vectorizer.fit(np.append(data.Context.values,data.Utterance.values))\n",
        "\n",
        "    def predict(self, context, utterances):\n",
        "        # Convert context and utterances into tfidf vector\n",
        "        vector_context = self.vectorizer.transform([context])\n",
        "        vector_doc = self.vectorizer.transform(utterances)\n",
        "        # The dot product measures the similarity of the resulting vectors\n",
        "        result = np.dot(vector_doc, vector_context.T).todense()\n",
        "        result = np.asarray(result).flatten()\n",
        "        # Sort by top results and return the indices in descending order\n",
        "        return np.argsort(result, axis=0)[::-1]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UyXrIJReQAiG",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 85
        },
        "outputId": "5664936c-a10a-4c22-f475-d9bc964bff91"
      },
      "source": [
        "# Evaluate TFIDF predictor\n",
        "pred = TFIDFPredictor()\n",
        "pred.train(train_df)\n",
        "y = [pred.predict(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]\n",
        "for n in [1, 2, 5, 10]:\n",
        "    print(\"Recall @ ({}, 10): {:g}\".format(n, evaluate_recall(y, y_test, n)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Recall @ (1, 10): 0.485624\n",
            "Recall @ (2, 10): 0.586681\n",
            "Recall @ (5, 10): 0.762474\n",
            "Recall @ (10, 10): 1\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y3h4SRxX9KMm",
        "colab_type": "text"
      },
      "source": [
        "### example\n",
        "* seklearn总对于CounterVectorizer、TfidfTransformer、TfidfVectorizer的理解\n",
        "* https://blog.csdn.net/m0_37324740/article/details/79411651\n",
        "* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CGrigO29QAc2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.feature_extraction.text import CountVectorizer\n",
        " \n",
        "vectorizer = CountVectorizer(min_df=1)\n",
        " \n",
        "corpus = [\n",
        "        'This is the first document.',\n",
        "    \t\t'This is the second second document.',\n",
        "    \t\t'And the third one.',\n",
        "\t\t    'Is this the first document?',\n",
        "\t\t]\n",
        "X = vectorizer.fit_transform(corpus)\n",
        "feature_name = vectorizer.get_feature_names()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oIdYqzqgnSPr",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 493
        },
        "outputId": "da630953-870f-4539-e4df-43e30009d712"
      },
      "source": [
        "print(feature_name)#获取词袋模型文本的所有关键字\n",
        "print(\"\\n\")\n",
        "print(X.toarray())#看到词频矩阵\n",
        "print(\"\\n\")\n",
        "print(X)#举证中出现则+1，否则为0"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n",
            "\n",
            "\n",
            "[[0 1 1 1 0 0 1 0 1]\n",
            " [0 1 0 1 0 2 1 0 1]\n",
            " [1 0 0 0 1 0 1 1 0]\n",
            " [0 1 1 1 0 0 1 0 1]]\n",
            "\n",
            "\n",
            "  (0, 8)\t1\n",
            "  (0, 3)\t1\n",
            "  (0, 6)\t1\n",
            "  (0, 2)\t1\n",
            "  (0, 1)\t1\n",
            "  (1, 8)\t1\n",
            "  (1, 3)\t1\n",
            "  (1, 6)\t1\n",
            "  (1, 1)\t1\n",
            "  (1, 5)\t2\n",
            "  (2, 6)\t1\n",
            "  (2, 0)\t1\n",
            "  (2, 7)\t1\n",
            "  (2, 4)\t1\n",
            "  (3, 8)\t1\n",
            "  (3, 3)\t1\n",
            "  (3, 6)\t1\n",
            "  (3, 2)\t1\n",
            "  (3, 1)\t1\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yJ8yGLDr8l9U",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 340
        },
        "outputId": "653b4535-f7af-4e4a-cff5-bd57585d6d32"
      },
      "source": [
        "from sklearn.feature_extraction.text import TfidfTransformer \n",
        "from sklearn.feature_extraction.text import CountVectorizer \n",
        " \n",
        " \n",
        "corpus = [\n",
        "    'This is the first document.',\n",
        "\t\t'This is the second second document.',\n",
        "\t\t'And the third one.',\n",
        "\t\t'Is this the first document?',\n",
        "\t\t]\n",
        " \n",
        "vectorizer=CountVectorizer()\n",
        " \n",
        "transformer = TfidfTransformer()\n",
        "tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus)) \n",
        "print (tfidf)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "  (0, 8)\t0.4387767428592343\n",
            "  (0, 6)\t0.35872873824808993\n",
            "  (0, 3)\t0.4387767428592343\n",
            "  (0, 2)\t0.5419765697264572\n",
            "  (0, 1)\t0.4387767428592343\n",
            "  (1, 8)\t0.2723014675233404\n",
            "  (1, 6)\t0.22262429232510395\n",
            "  (1, 5)\t0.8532257361452786\n",
            "  (1, 3)\t0.2723014675233404\n",
            "  (1, 1)\t0.2723014675233404\n",
            "  (2, 7)\t0.5528053199908667\n",
            "  (2, 6)\t0.2884767487500274\n",
            "  (2, 4)\t0.5528053199908667\n",
            "  (2, 0)\t0.5528053199908667\n",
            "  (3, 8)\t0.4387767428592343\n",
            "  (3, 6)\t0.35872873824808993\n",
            "  (3, 3)\t0.4387767428592343\n",
            "  (3, 2)\t0.5419765697264572\n",
            "  (3, 1)\t0.4387767428592343\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "424MMvZd8mr5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "tfidf2 = TfidfVectorizer()\n",
        "re = tfidf2.fit_transform(corpus)\n",
        "print (re)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ht4ssr0n8mu2",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 340
        },
        "outputId": "b0943cf8-96a6-4c11-ef05-2578a4cdd66f"
      },
      "source": [
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "tfidf2 = TfidfVectorizer()\n",
        "re = tfidf2.fit_transform(corpus)\n",
        "print (re)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "  (0, 1)\t0.4387767428592343\n",
            "  (0, 2)\t0.5419765697264572\n",
            "  (0, 6)\t0.35872873824808993\n",
            "  (0, 3)\t0.4387767428592343\n",
            "  (0, 8)\t0.4387767428592343\n",
            "  (1, 5)\t0.8532257361452786\n",
            "  (1, 1)\t0.2723014675233404\n",
            "  (1, 6)\t0.22262429232510395\n",
            "  (1, 3)\t0.2723014675233404\n",
            "  (1, 8)\t0.2723014675233404\n",
            "  (2, 4)\t0.5528053199908667\n",
            "  (2, 7)\t0.5528053199908667\n",
            "  (2, 0)\t0.5528053199908667\n",
            "  (2, 6)\t0.2884767487500274\n",
            "  (3, 1)\t0.4387767428592343\n",
            "  (3, 2)\t0.5419765697264572\n",
            "  (3, 6)\t0.35872873824808993\n",
            "  (3, 3)\t0.4387767428592343\n",
            "  (3, 8)\t0.4387767428592343\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uH8nWPOqxPc4",
        "colab_type": "text"
      },
      "source": [
        "### 神经网络建模"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AEedxQZOyu87",
        "colab_type": "text"
      },
      "source": [
        "代码参考文件"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KbGogTc8abvf",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "d0ee8c1e-0d4b-417f-dc9b-5102fd6058d9"
      },
      "source": [
        "cd /content/drive/My Drive"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/content/drive/My Drive\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hBf_F45vadH_",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        },
        "outputId": "b149d257-068c-4f9f-8828-cc8fe48ac0c5"
      },
      "source": [
        "ls"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            " \u001b[0m\u001b[01;34mbot\u001b[0m/       \u001b[01;34mchatbot-retrieval\u001b[0m/   \u001b[01;34mdatasets\u001b[0m/   \u001b[01;34mtemp\u001b[0m/   \u001b[01;34mxiamen\u001b[0m/\n",
            " \u001b[01;34mchatBot\u001b[0m/  \u001b[01;34m'Colab Notebooks'\u001b[0m/    \u001b[01;34mHUAWEI\u001b[0m/     \u001b[01;34mVQA\u001b[0m/    入职文件.zip\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eZrRXXKkUZgD",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 136
        },
        "outputId": "fc18d6b3-46b0-44b0-9120-301c5a2a5efe"
      },
      "source": [
        "#!git clone https://github.com/dennybritz/chatbot-retrieval"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Cloning into 'chatbot-retrieval'...\n",
            "remote: Enumerating objects: 3, done.\u001b[K\n",
            "remote: Counting objects: 100% (3/3), done.\u001b[K\n",
            "remote: Compressing objects: 100% (3/3), done.\u001b[K\n",
            "remote: Total 392 (delta 0), reused 0 (delta 0), pack-reused 389\u001b[K\n",
            "Receiving objects: 100% (392/392), 19.63 MiB | 7.40 MiB/s, done.\n",
            "Resolving deltas: 100% (225/225), done.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nwSfKXWSaSGG",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "0f7703e7-eede-4757-ede7-e1603a0b0c3e"
      },
      "source": [
        "cd chatbot-retrieval/"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/content/drive/My Drive/chatbot-retrieval\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "85ymeWTWUsP8",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 68
        },
        "outputId": "75d4c943-a573-460e-ac17-912f873273b1"
      },
      "source": [
        "ls"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\u001b[0m\u001b[01;34mdata\u001b[0m/    \u001b[01;34mnotebooks\u001b[0m/    requirements.txt  udc_inputs.py   udc_predict.py\n",
            "LICENSE  \u001b[01;34m__pycache__\u001b[0m/  \u001b[01;34mscripts\u001b[0m/          udc_metrics.py  udc_test.py\n",
            "\u001b[01;34mmodels\u001b[0m/  README.md     udc_hparams.py    udc_model.py    udc_train.py\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cDMVEvHVTQSS",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "5ed853e0-75be-404d-eaf7-1d268eea9a74"
      },
      "source": [
        "!python udc_train.py"
      ],
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "WARNING:tensorflow:From udc_train.py:28: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.\n",
            "\n",
            "WARNING:tensorflow:From udc_train.py:64: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.\n",
            "\n",
            "WARNING:tensorflow:\n",
            "The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
            "For more information, please see:\n",
            "  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
            "  * https://github.com/tensorflow/addons\n",
            "  * https://github.com/tensorflow/io (for I/O related ops)\n",
            "If you depend on functionality not listed there, please file an issue.\n",
            "\n",
            "W0828 09:46:21.833918 139745635493760 lazy_loader.py:50] \n",
            "The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
            "For more information, please see:\n",
            "  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
            "  * https://github.com/tensorflow/addons\n",
            "  * https://github.com/tensorflow/io (for I/O related ops)\n",
            "If you depend on functionality not listed there, please file an issue.\n",
            "\n",
            "WARNING:tensorflow:From udc_train.py:40: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.\n",
            "W0828 09:46:21.834181 139745635493760 deprecation.py:323] From udc_train.py:40: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.\n",
            "WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/estimators/estimator.py:1180: BaseEstimator.__init__ (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*\n",
            "W0828 09:46:21.834500 139745635493760 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/estimators/estimator.py:1180: BaseEstimator.__init__ (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*\n",
            "INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f18a4300080>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {\n",
            "  per_process_gpu_memory_fraction: 1.0\n",
            "}\n",
            ", '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/content/drive/My Drive/chatbot-retrieval/runs/1598607981', '_session_creation_timeout_secs': 7200}\n",
            "I0828 09:46:21.834816 139745635493760 estimator.py:456] Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f18a4300080>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {\n",
            "  per_process_gpu_memory_fraction: 1.0\n",
            "}\n",
            ", '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/content/drive/My Drive/chatbot-retrieval/runs/1598607981', '_session_creation_timeout_secs': 7200}\n",
            "WARNING:tensorflow:From /content/drive/My Drive/chatbot-retrieval/udc_metrics.py:11: MetricSpec.__init__ (from tensorflow.contrib.learn.python.learn.metric_spec) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.estimator.EstimatorSpec.eval_metric_ops.\n",
            "W0828 09:46:21.835674 139745635493760 deprecation.py:323] From /content/drive/My Drive/chatbot-retrieval/udc_metrics.py:11: MetricSpec.__init__ (from tensorflow.contrib.learn.python.learn.metric_spec) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.estimator.EstimatorSpec.eval_metric_ops.\n",
            "WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/monitors.py:279: BaseMonitor.__init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.\n",
            "Instructions for updating:\n",
            "Monitors are deprecated. Please use tf.train.SessionRunHook.\n",
            "W0828 09:46:21.836106 139745635493760 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/monitors.py:279: BaseMonitor.__init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.\n",
            "Instructions for updating:\n",
            "Monitors are deprecated. Please use tf.train.SessionRunHook.\n",
            "WARNING:tensorflow:From /content/drive/My Drive/chatbot-retrieval/udc_inputs.py:46: read_batch_features (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "W0828 09:46:21.841871 139745635493760 deprecation.py:323] From /content/drive/My Drive/chatbot-retrieval/udc_inputs.py:46: read_batch_features (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py:833: read_keyed_batch_features (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "W0828 09:46:21.842051 139745635493760 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py:833: read_keyed_batch_features (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py:542: read_keyed_batch_examples (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "W0828 09:46:21.842256 139745635493760 deprecation.py:323] From /tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py:542: read_keyed_batch_examples (from tensorflow.contrib.learn.python.learn.learn_io.graph_io) is deprecated and will be removed in a future version.\n",
            "Instructions for updating:\n",
            "Use tf.data.\n",
            "Traceback (most recent call last):\n",
            "  File \"udc_train.py\", line 64, in <module>\n",
            "    tf.app.run()\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py\", line 40, in run\n",
            "    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)\n",
            "  File \"/usr/local/lib/python3.6/dist-packages/absl/app.py\", line 299, in run\n",
            "    _run_main(main, args)\n",
            "  File \"/usr/local/lib/python3.6/dist-packages/absl/app.py\", line 250, in _run_main\n",
            "    sys.exit(main(argv))\n",
            "  File \"udc_train.py\", line 61, in main\n",
            "    estimator.fit(input_fn=input_fn_train, steps=None, monitors=[eval_monitor])\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py\", line 507, in new_func\n",
            "    return func(*args, **kwargs)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/estimators/estimator.py\", line 524, in fit\n",
            "    loss = self._train_model(input_fn=input_fn, hooks=hooks)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/estimators/estimator.py\", line 1038, in _train_model\n",
            "    features, labels = input_fn()\n",
            "  File \"/content/drive/My Drive/chatbot-retrieval/udc_inputs.py\", line 46, in input_fn\n",
            "    name=\"read_batch_features_{}\".format(mode))\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py\", line 324, in new_func\n",
            "    return func(*args, **kwargs)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py\", line 833, in read_batch_features\n",
            "    name=name)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py\", line 324, in new_func\n",
            "    return func(*args, **kwargs)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py\", line 542, in read_keyed_batch_features\n",
            "    name=scope)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py\", line 324, in new_func\n",
            "    return func(*args, **kwargs)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py\", line 183, in read_keyed_batch_examples\n",
            "    seed=seed)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py\", line 386, in _read_keyed_batch_examples_helper\n",
            "    file_names = _get_file_names(file_pattern, randomize_input)\n",
            "  File \"/tensorflow-1.15.2/python3.6/tensorflow_core/contrib/learn/python/learn/learn_io/graph_io.py\", line 282, in _get_file_names\n",
            "    raise ValueError('No files match %s.' % file_pattern)\n",
            "ValueError: No files match ['/content/drive/My Drive/chatbot-retrieval/data/train.tfrecords'].\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Tmu9IRyeWaEc",
        "colab_type": "code",
        "cellView": "both",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "b6643694-b705-4740-dc0d-8f9fb0710ad4"
      },
      "source": [
        "# %tensorflow_version 1.15.2\n",
        "# import tensorflow as tf\n",
        "# tf.__version__"
      ],
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'1.15.2'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zfVTgCHXeH4Q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python udc_test.py --model_dir=..."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "k8XnlMcneKYl",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python udc_predict.py --model_dir=..."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_FZl8UmnWaHi",
        "colab_type": "code",
        "colab": {},
        "cellView": "both"
      },
      "source": [
        ""
      ],
      "execution_count": 10,
      "outputs": []
    }
  ]
}